Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1000325.1 : Noisy I2C bus causes picld to report false environmental FRU failures.
PreviouslyPublishedAs 200444 Product Sun Fire 280R Server Sun Fire V880 Server Sun Fire V490 Server Sun Fire V890 Server Sun Fire V480 Server Bug Id <SUNBUG: 6214188>, <SUNBUG: 6250164>, <SUNBUG: 6286821>, <SUNBUG: 6337779>, <SUNBUG: 4801542> Impact It has been observed on a small number of systems that noisy I2C bus signals cause false environmental conditions to be reported and logged by picld, the system environmental monitoring daemon. Whether a customer sees this issue is dependent on the system configuration, application services, I/O load and possibly external environmental factors. The false conditions may lead to multiple service calls and FRU replacements for parts that appear to be bad, however FRU's replaced do not resolve the reported issues leading to customer dissatisfaction. In extreme cases on some platforms, a system immediate power-down event may occur due to a false critical temperature sensor reading, causing an unplanned outage. Symptoms The following symptoms have been observed on the platforms noted: Temperature changes on a single sensor inconsistent with actual temperature changes in the system's environment, or other temperature sensor readings V880/V890: Jan 3 20:25:54 hostname picld[72]: [ID 916734 daemon.error] CRITICAL : LOW TEMPERATURE DETECTED -50, IOB_AMB_TEMPERATURE_SENSOR Jan 4 00:41:11 hostname picld[72]: [ID 916734 daemon.error] CRITICAL : LOW TEMPERATURE DETECTED -50, MB_AMB_TEMPERATURE_SENSOR V480/V490: Jan 2 06:59:57 hostname picld[9752]: [ID 690822 daemon.error] CRITICAL : HIGH TEMPERATURE DETECTED 120, CPU0_DIE_TEMPERATURE_SENSOR Jan 5 02:47:28 hostname picld[60]: [ID 690822 daemon.error] CRITICAL : HIGH TEMPERATURE DETECTED 120, CPU2_DIE_TEMPERATURE_SENSOR 280R: Jan 5 21:07:21 hostname picld[177]: [ID 733768 daemon.error] WARNING : LOW TEMPERATURE DETECTED -1, CPU0_DIE_TEMPERATURE_SENSOR Jan 5 22:20:54 hostname picld[177]: [ID 733768 daemon.error] WARNING : LOW TEMPERATURE DETECTED -1, CPU1_DIE_TEMPERATURE_SENSOR Hot-plug or keyswitch event changes known to have physically not occurred. Jan 1 18:21:29 hostname picld[78]: [ID 293134 daemon.error] Device PS0 unplugged Jan 1 18:21:34 hostname picld[78]: [ID 673612 daemon.error] Device PS0 Plugged in Jan 3 03:44:49 hostname picld[72]: [ID 707422 daemon.error] Keyswitch position could not be determined Jan 3 05:42:01 hostname picld[60]: [ID 727222 daemon.error] Device DISK1 inserted Jan 3 05:45:10 hostname picld[60]: [ID 789240 daemon.error] Device DISK1 removed Jan 3 05:45:45 hostname picld[60]: [ID 727222 daemon.error] Device DISK1 inserted FRU failure is reported followed not long later by the FRU reported as OK, with no physical interaction or known external changes occurring between reported events. Jan 1 18:56:37 hostname picld[78]: [ID 625010 daemon.error] WARNING: Device PS2 failure detected Jan 1 18:56:37 hostname picld[78]: [ID 702911 daemon.error] PS2_FAN_FAIL_SENSOR Jan 1 18:58:25 hostname picld[78]: [ID 449286 daemon.error] Device PS2 OK Secondary Fan Tray failures without a Primary Fan Tray failure on Sun Fire V880 & V890 servers. On these systems, the Secondary Fan Tray's do not run unless the Primary Fan Tray has failed or is not present, so it does not make sense that it could be considered failed if it is not even running. Jan 1 18:49:48 hostname picld[78]: [ID 625010 daemon.error] WARNING: Device CPU0_SEC_FAN failure detected ^^--This fan should not even be running to fail unless the primary already failed and is not running Jan 1 18:50:00 hostname picld[78]: [ID 300385 daemon.error] Secondary fan failure, device CPU0_SEC_FAN ^^--This fan should not even be running to fail unless the primary already failed and is not running Jan 1 18:57:54 hostname picld[78]: [ID 300385 daemon.error] Secondary fan failure, device CPU0_SEC_FAN ^^--This fan should not even be running to fail unless the primary already failed and is not running Jan 3 03:25:55 hostname picld[72]: [ID 625010 daemon.error] WARNING: Device CPU0_PRIM_FAN failure detected Jan 4 20:12:59 hostname picld[72]: [ID 625010 daemon.error] WARNING: Device IO0_PRIM_FAN failure detected ^^--Failed over to Secondary OK Jan 4 21:15:42 hostname picld[72]: [ID 625010 daemon.error] WARNING: Device IO0_SEC_FAN failure detected ^^--Failed back over to Primary OK even though it was supposedly already failed Jan 4 21:15:49 hostname picld[72]: [ID 300385 daemon.error] Secondary fan failure, device IO0_SEC_FAN ^^--This fan should not even be running to fail unless the primary is failed and not running; this bouncing with and without OK messages are false events. Multiple environmental failures are reported coming from multiple FRU's, with little consistency to indicate any specific one as faulty. The majority of the reported events occur at times of peak I/O application load. In the cases where FRU replacements have occurred, they do not resolve the issue and the same FRU is failed at a later time, which is an unlikely probability. Consistency and timing of the reported FRU failure is the key to identifying false failures covered by this patch fix, versus real failures that require a FRU replacement. If only 1 FRU event is occurring and a real failure is suspected, OBDiag i2c or SunVTS env/i2c tests should be run over a number of passes e.g. 50 with the system under normal or no I/O load. If failures do not reoccur when tested repeatedly under these conditions, then it is most likely not a true fault. If repeatable and consistent failures do occur under OBDiag/SunVTS tests with the FRU installed regardless of the I/O load conditions, and do not occur when tests are repeated with the FRU removed, then it is likely a real FRU failure. Root Cause The picl environmental monitoring daemon needs to account for noise data and correction as this is the nature of complex I2C bus implementations on the affected platforms. Due to the nature of some application data transfer's on the system bus, the noise may be triggered for a period of time that picl retries still return a false reading. This has been most prevalent on systems whose prime application usage is for backup software such as Solstice Backup, Legato Networker and Veritas NetBackup, possibly due to the streaming block I/O these applications generate on the system bus. The false failures cannot be resolved with hardware replacements. Use patchadd or showrev to identify if the required patches are not installed. If they are installed, check to see if the picl I2C event tuning files are present. If not, then they should be created and tuned per the instructions below until the false failures are no longer reported. Workaround A system that is affected by this problem can tune the retry count values until false sensor readings are no longer seen. False sensor readings have caused loss of service (due to shutdowns) and unneeded replacement of parts that were incorrectly diagnosed as faulty. The original fix for the problem was developed in response to CR 4801542. This fix introduced the concept of a retry value when reading certain sensors. Unfortunately, the retry logic was not applied to many sensors, and it used a global hard-coded retry value of 2 retries, with a 1-second sleep period between retries. Some sensors were also given a 1-second sleep period in between retries; other sensors were retried according to the "interval" parameter already in place in the platforms' platsvcd.conf files. These new patches extend the original fix by:
Resolution 1. Apply the appropriate picld patch for Solaris 8, 9 or 10:
2. Monitor the system for further false positive events with the patch default retry value of five (5). If no further events occur, then no additional action is necessary. 3. In the event of further false positive events occurring, make a note of which event sensors are being reported. Shutdown picld: a) For Solaris 8 or 9, use the following command as root: # /etc/init.d/picld stop b) For Solaris 10, use the following command as root: # svcadm disable -t picl 4. Create the following tuning files appropriate for the platform. Each platform will have a sun4u i2cparam.conf file, and a platform-specific i2cparam.conf file. Comments are allowed in these files, and always begin with a # in the first column. These files are not created by the patch installation process, and must be manually created with a text editor. The i2cparam.conf text files are used to specify the retry values and retry sleep time on a per-sensor basis. Picl will read the sensor N consecutive times, and the reading must stay the same to be sure it is not a false reading. The default value is 5 retries. The retry sleep time specifies for picl to sleep for the designated number of seconds between readings, to allow whatever other bus activity is occurring to complete, reducing the chance of signal noise still being present on each N consecutive read. The default value is 1 second sleep time between retries. a) For all platforms, create a text file named "i2cparam.conf" in the "/usr/platform/sun4u/lib" directory. This example documents all of the possible sun4u retry values: # /usr/platform/sun4u/lib/i2cparam.conf file # retry value for reading temperature n_read_temp 5 # retry value for reading keyswitch n_retry_keyswitch 5 retry_sleep_keyswitch 1 # retry value for detecting hotplug n_retry_hotplug 5by retry_sleep_hotplug 1 # retry value for detecting hotplug, fan n_retry_fan_hotplug 5 retry_sleep_fan_hotplug 1 # retry value for detecting fan presence n_retry_fan_present 5 retry_sleep_fan_present 1 # end of /usr/platform/sun4u/lib/i2cparam.conf file b) For the platform-specific picl events, create a text file named "i2cparam.conf" in the platform directory listed as follows. These examples document all of the possible platform-specific retry values: The V880-specific "i2cparam.conf" file lives in "/usr/platform/SUNW,Sun-Fire-880/lib" and the V890-specific "i2cparam.conf" file lives in "/usr/platform/SUNW,Sun-Fire-V890/lib". Both platforms use identical platform-specific sensors, so use this same example: # /usr/platform/SUNW,Sun-Fire-880/lib/i2cparam.conf file # retry value for checking power supply hotplug status n_retry_pshp_status 5 retry_sleep_pshp_status 1 # retry value for detecting overcurrent n_read_overcurrent 5 # retry value for detecting failed devices n_retry_devicefail 5 retry_sleep_devicefail 1 # retry value for detecting fan faults n_read_fanfault 5 # retry value for detecting power supply hotplug n_retry_pshp 5 retry_sleep_pshp 1 # retry value for detecting disk faults n_retry_diskfault 5 retry_sleep_diskfault 1 # retry value for out of temperature range shutdowns n_retry_temp_shutdown 5 retry_sleep_temp_shutdown 1 # end of /usr/platform/SUNW,Sun-Fire-880/lib/i2cparam.conf file The V480-specific "i2cparam.conf" file lives in "/usr/platform/SUNW,Sun-Fire-480R/lib" and the V490-specific "i2cparam.conf" file lives in "/usr/platform/SUNW,Sun-Fire-V490/lib". Both platforms use identical platform-specific sensors, so use this same example: # /usr/platform/SUNW,Sun-Fire-480R/lib/i2cparam.conf file # retry value for fan status n_retry_fan 5 retry_sleep_fan 1 # retry value for power supply status n_retry_ps_status 5 retry_sleep_ps_status 1 # retry value for detecting power supply hotplug n_retry_pshp 5 retry_sleep_pshp 1 # retry value for detecting disk hotplug n_retry_diskhp 5 retry_sleep_diskhp 1 # retry value for out of temperature range shutdowns n_retry_temp_shutdown 5 retry_sleep_temp_shutdown 1 # retry value for detecting fsp faults n_retry_fsp_fault 5 retry_sleep_fsp_fault 1 # end of /usr/platform/SUNW,Sun-Fire-480R/lib/i2cparam.conf file The 280R-specific "i2cparam.conf" file lives in "/usr/platform/SUNW,Sun-Fire-280R/lib" Here's an example that documents all of the possible 280R-specific retry values: # /usr/platform/SUNW,Sun-Fire-280R/lib/i2cparam.conf file # retry value for reading temperature n_retry_temp 5 retry_sleep_temp 1 # retry value for detecting hotplug events n_retry_hotplug 5 retry_sleep_hotplug 1 # end of /usr/platform/SUNW,Sun-Fire-280R/lib/i2cparam.conf file 5. Based on the sensors noted in step 3 that are still reporting false events, change the N retry value to 8 in either the sun4u or platform-specific tuning file that applies to that sensor event type. For example, if the V880 is still reporting false fan tray hotplug, PSU hotplug and temperature reading events after applying the patch, modify the appropriate values as follows:
change "n_read_temp" value from "5" to "8" change "n_retry_fan_hotplug" value from "5" to "8"
change "n_retry_pshp" value from "5" to "8" Most systems will not need tuning of the retry sleep values, however for some systems with heavy application CPU or I/O usage, the retry sleep value for these events should also be tuned up to reduce the impact of picld retries on performance of the system's applications. It is recommended to try doubling this in line with N-retries value, changing from 1 second to 2 seconds, and 2 seconds to 4 seconds and so on for subsequent tuning changes. If the values used are excessive, then you may see more of the "PSVC application death detected" messages noted below in Step 7. 6. Restart picld with the tuning files in place, and picld will read the files and use these values. a) For Solaris 8 or 9, use the following command as root: # /etc/init.d/picld start b) For Solaris 10 or later with the Solaris service management facility, use the following command as root: # svcadm enable picl 7. Monitor the system during peek load periods where prior false positives may have occurred, for further false positive events. If no further false events occur during these peaks, then do no additional tuning. If picld reports a future failure, that consistently repeats and follows a FRU or known event occurring, then this will be a real FRU failure or event detection. If further events occur for the sensors tuned in Step.5 at times other events previously occurred, then it is recommended to:
For example, if the V880 events tuned in the Step.5 example, temperature and PSU hotplug events still occur but Fan hotplug events are no longer observed, then change only the "n_read_temp" and "n_retry_pshp" values from "8" to "16". It is critical the system be continuously monitored for further false events and tune accordingly. By monitoring the system, and tuning as needed, and then re-monitoring and re-tuning, this patch should allow you and/or the customer to eliminate or minimize the problem. You will always need to restart picld after making any changes to the tuneables. Due to the additional system overhead of doing the extra retry on each sensor in each event class that is tuned, it is NOT recommended to tune to a high amount immediately, or to tune events not seeing false positives recorded. There is no maximum value for retries, however tuning to an inappropriately high retry value may lead to inappropriate response times for picld seeing real events that occur, making other service or administration actions take longer to be actioned than necessary on an untuned system, e.g. keyswitch changes or disk hotplug events may not be seen for several minutes after the event occurs. In some cases if the retry value is too high for the system load, the policy may not be ever runnable due to application and other picl sensor readings occurring in a cyclic fashion, so the number of retries exceeds the CPU time available between runs, resulting in daemon errors similar to the following periodically, which can be ignored as an artifact of the heavily loaded system: Jan 4 06:57:15 hostname picld[5433]: [ID 230523 daemon.error] PSVC application death detected
NOTE: There is a way of running picl that will display all of the retry parameters currently being used, including defaults when no "i2cparam.conf" files exist yet. This is useful when creating new "i2cparam.conf" files (i.e. copy & paste) and/or when the retry parameter documentation is not readily available. This works for Solaris 8, 9 and 10. 1) Stop picld. (I.e., /etc/init.d/picld stop or svcadm disable -t picl) 2) Run picl directly using the "-i -v 100" command line options. "-i" means to run interactively "-v 100" means to be very verbose Example: host-root## ps -ef | grep picld root 56 1 0 Feb 27 ? 0:00 /usr/lib/picl/picld root 4440 340 0 15:30:14 console 0:00 grep picld host-root## /etc/init.d/picld stop host-root## ps -ef | grep picld root 4448 340 0 15:32:43 console 0:00 grep picld host-root## cd /usr/lib/picl host-root## ./picld -i -v 100 ... # No /usr/platform/sun4u/lib/i2cparam.conf file, using defaults n_read_temp 5 n_retry_keyswitch 5 retry_sleep_keyswitch 1 n_retry_hotplug 5 retry_sleep_hotplug 1 n_retry_fan_hotplug 5 retry_sleep_fan_hotplug 1 n_retry_fan_present 5 retry_sleep_fan_present 1 # No /usr/platform/SUNW,Sun-Fire-V890/lib/i2cparam.conf file, using defaults n_retry_pshp_status 5 retry_sleep_pshp_status 1 n_read_overcurrent 5 n_retry_devicefail 5 retry_sleep_devicefail 1 n_read_fanfault 5 n_retry_pshp 5 retry_sleep_pshp 1 n_retry_diskfault 5 retry_sleep_diskfault 1 n_retry_temp_shutdown 5 retry_sleep_temp_shutdown 1 ... 3) Stopping picl is accomplished with Ctrl-C (^C) and restarting is done with either svcadm or the init.d script. There will be no further changes to picld to modify the behavior or nature of this fix, until such time as Predictive Self-Healing (PSH) on Solaris 10 or later has the algorithmic smarts to understand and filter the false values for I2C data sensors. There are currently no engineering plans to address this scenario within the PSH framework. Previously Published As 102279 Internal Contributor/submitter oliver.sharwood@sun.com Internal Eng Business Unit Group KE Authors Internal Eng Responsible Engineer justin.frank@sun.com Internal Services Knowledge Engineer sean.hassall@sun.com Internal Resolution Patches 111792-15, 118558-22, 121944-01 Internal Kasp FAB Legacy ID 102279 Internal Sun Alert & FAB Admin Info Critical Category: Significant Change Date: Avoidance: Patch Responsible Manager: null Original Admin Info: null Internal SA-FAB Eng Submission Noisy I2C Bus causes picld to report false environmental FRU failures. Product_uuid 296f2476-0a18-11d6-86cf-c8096baa086c|Sun Fire 280R Server 29726712-0a18-11d6-8636-c7e996b581dc|Sun Fire V880 Server 5c71fc02-5e51-11d7-8add-8938754df22a|Sun Fire V490 Server 5d2816fe-5e51-11d7-8de2-d7bc0dd226fc|Sun Fire V890 Server a2b9bc2b-52c6-45c2-a3e0-f19bd2c86953|Sun Fire V480 Server Attachments This solution has no attachment |
||||||||||||
|