Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1222158.1
Update Date:2011-02-18
Keywords:

Solution Type  FAB (standard) Sure

Solution  1222158.1 :   NEMHydra's Main Power Shuts Down Unexpectedly  


Related Items
  • Exadata Database Machine X2-8
  •  
  • Sun Blade 6000 System
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  




In this Document
  Symptoms
  Changes
  Cause
  Solution


Oracle Confidential (PARTNER). Do not distribute to customers
Reason: FABs available to Internals and Partners only

Applies to:

Sun Blade 6000 System - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Exadata Database Machine X2-8 - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Information in this document applies to any platform.
__________



Escalation ID: 41143985
_________

Affected Parts: (FRU/CRU Part Number / Description)

540-7695 - 16-Port Virtualized Multi-Fabric Network Express Module (X4238)

Symptoms

The Sun Blade 6000 (SPARC) will panic and X86 blades will lose communication with the affected NEM.

Example panic string:
    panic[cpu100]/thread=2a104663ca0:
Fatal error has occured in: PCIe fabric.(0x0)(0x41)

Check the FMA errors after the blade reboots. Look for the following signature, to determine if it was a surprise down event on the NEM due to it processing a KILLALL signal. The FMA event will be on one of the NEM modules:

    grep pcie_ue_status */fma/*fmdump*
pcie_ue_status = 0x20 = surprise down
Impact

The main power is turned off causing the NemHydra to power off.  This will cause the blades OS to react to a network device loss.

Changes

Contributing Factors

Sun Blade 6000 Virtualized Multi-Fabric 10GbE Network Express Module.

Increased i2c activities could affect a corrupted read/write to the ADM1066.

The ADM1066 is a stand alone power sequencer and monitoring device which monitors multiple voltage rails and is also in charge of initiating power down with KILLALL signal to the NEM. The NEM contains two of these ADM1066 devices.

Cause

Root Cause

Due to the inability to consistently repeat this failure, we do not know what device asserts the KILLALL on the NEM's ADM1066.   The SAS expander could be a possible suspect as by design it will assert KILLALL when i2c temperature (ambient and junction) readings exceed 75C, 120C.  Although these temperatures were never observed in a failing environment, a corrupted temperature read could cause this effect.

By blocking the KILLALL signal on the ADM1066, the SAS expander can no longer shut down the NEM due to false overtemp reading. However the SAS expander will still turn on the LED when the warning threshold (65C, 100C) is actually reached. Also when a real NEM overtemp occurs, the voltage would increase and the main power sequencer will detect it and shutdown the NEM.

Solution

Workaround

No workaround available - see Resolution section.

Resolution

The final resolution to this issue is to update the NEMHydras main power sequencer firmware to prevent the KILLALL signal from being asserted on the ADM1066.  The firmware is available from the installation section on SW2.2.

Download Sun Blade 6000 Virtualized Multi- Fabric 10GbE Network Express Module Software 2.2 for installation available at:

https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=SB6000VMF10GbESW-2.2-OTH-G-F@CDS-CDS_SMI&ProductUUID=aFqJ_hCyhJwAAAEpHzZudFxG&ProductID=aFqJ_hCyhJwAAAEpHzZudFxG&Origin=ViewProductDetail-Start&ERROR_User=UserNotLoggedIn

Reference the attached document for Power Sequencer firmware update instructions, which will require ILOM "escalation mode" and an FE onsite to perform these instructions.

References


For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

* http://tns.central/fab

In addition to the above you may email:

* FAB-Manager@sun.com


Contacts

Contributor: daniel.p.lord@oracle.com
Responsible Engineer: richard.j.li@oracle.com
Responsible Manager: david.mullenex@oracle.com
Business Unit Group: Systems Group-x64 (X4100-X4600 (and M2), V20z/V40z/V60z/V65z, @Ultra20/40 (and M2) Workstations), Systems Group-SVS (SPARC Volume Systems, Horizontal @Systems,(includes T2000/Ontario)

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback