Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1222158.1 : NEMHydra's Main Power Shuts Down Unexpectedly
In this Document
Oracle Confidential (PARTNER). Do not distribute to customers
Applies to:Sun Blade 6000 System - Version: Not Applicable to Not Applicable - Release: N/A to N/AExadata Database Machine X2-8 - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Information in this document applies to any platform. __________ Escalation ID: 41143985 _________ Affected Parts: (FRU/CRU Part Number / Description) 540-7695 - 16-Port Virtualized Multi-Fabric Network Express Module (X4238) Symptoms
The Sun Blade 6000 (SPARC) will panic and X86 blades will lose communication with the affected NEM. Example panic string: panic[cpu100]/thread=2a104663ca0: Check the FMA errors after the blade reboots. Look for the following signature, to determine if it was a surprise down event on the NEM due to it processing a KILLALL signal. The FMA event will be on one of the NEM modules: grep pcie_ue_status */fma/*fmdump*Impact The main power is turned off causing the NemHydra to power off. This will cause the blades OS to react to a network device loss. ChangesContributing FactorsSun Blade 6000 Virtualized Multi-Fabric 10GbE Network Express Module. Increased i2c activities could affect a corrupted read/write to the ADM1066. The ADM1066 is a stand alone power sequencer and monitoring device which monitors multiple voltage rails and is also in charge of initiating power down with KILLALL signal to the NEM. The NEM contains two of these ADM1066 devices. CauseRoot CauseDue to the inability to consistently repeat this failure, we do not know what device asserts the KILLALL on the NEM's ADM1066. The SAS expander could be a possible suspect as by design it will assert KILLALL when i2c temperature (ambient and junction) readings exceed 75C, 120C. Although these temperatures were never observed in a failing environment, a corrupted temperature read could cause this effect. By blocking the KILLALL signal on the ADM1066, the SAS expander can no longer shut down the NEM due to false overtemp reading. However the SAS expander will still turn on the LED when the warning threshold (65C, 100C) is actually reached. Also when a real NEM overtemp occurs, the voltage would increase and the main power sequencer will detect it and shutdown the NEM. SolutionWorkaroundNo workaround available - see Resolution section. Resolution The final resolution to this issue is to update the NEMHydras main power sequencer firmware to prevent the KILLALL signal from being asserted on the ADM1066. The firmware is available from the installation section on SW2.2. Download Sun Blade 6000 Virtualized Multi- Fabric 10GbE Network Express Module Software 2.2 for installation available at: https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=SB6000VMF10GbESW-2.2-OTH-G-F@CDS-CDS_SMI&ProductUUID=aFqJ_hCyhJwAAAEpHzZudFxG&ProductID=aFqJ_hCyhJwAAAEpHzZudFxG&Origin=ViewProductDetail-Start&ERROR_User=UserNotLoggedIn Reference the attached document for Power Sequencer firmware update instructions, which will require ILOM "escalation mode" and an FE onsite to perform these instructions. References For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL: * http://tns.central/fab In addition to the above you may email: * FAB-Manager@sun.com Contacts Contributor: daniel.p.lord@oracle.com Responsible Engineer: richard.j.li@oracle.com Responsible Manager: david.mullenex@oracle.com Business Unit Group: Systems Group-x64 (X4100-X4600 (and M2), V20z/V40z/V60z/V65z, @Ultra20/40 (and M2) Workstations), Systems Group-SVS (SPARC Volume Systems, Horizontal @Systems,(includes T2000/Ontario) Attachments This solution has no attachment |
||||||||||||
|