Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1019279.1
Update Date:2010-08-17
Keywords:

Solution Type  FAB (standard) Sure

Solution  1019279.1 :   False CPU Thermal Trip errors are being seen on Sun Blade X6250.  


Related Items
  • Sun Blade X6250 Server Module
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
238104


Bug Id
<SUNBUG: 6671702>, <SUNBUG: 6691825>

Product
Sun Blade X6250 Server Module

Date of Resolved Release
23-May-2008

Thermal Trip errors on Sun Blade X6250 Modules (see details below).

Affected X-Options:

X4511A   1.86 GHz CPU, Xeon E5320 Quad-Core, 80W
X4512A   1.60 GHz CPU, Xeon L5310 Quad Core, 50W
X4513A   2.33 GHz CPU, Xeon E5345 Quad Core, 80W
X4514A   2.66 GHz CPU, Xeon X5355 Quad Core, 120W
X4515A   3.00 GHz CPU, Xeon X5365 Quad Core, 120W
X4517A   2.50 GHz CPU, Xeon L5420 Quad-Core, 50W
X4518A   2.33 GHz CPU, Xeon E5410 Quad Core, 80W
X4519A   2.83 GHz CPU, Xeon E5440 Quad Core, 80W
X4520A   3.16 GHz CPU, Xeon X5460 Quad Core, 120W

Affected Parts:

All Intel 5300/5400 series processors (Clovertown/Harpertown) CPUs are affected.

Impact

Systems are exhibiting false Thermal Trip errors on CPU modules.  On some systems there will just be false errors within the ELOM SEL log, with others the CPU will also falsely show disabled within the ELOM cli/gui.  Please make note that if experiencing this issue all of these signs noted above are false, and the host OS will still show all CPUs and cores online and functioning normally.

To help determine a real event from a false event the below paragraph is a description of how a system will behave if a real Thermal Event took place.

As far as the ELOM is concerned the error messages will be the same as a false Thermal Trip except you should notice that the host system is rebooting right after the time of the Thermal Trip error, and not before.  Where in these false events the Thermal Trip errors take place just after a normal reboot of the host, or randomly while the host system is up and running with no reboot associated.  The host in the presence of a real event will reboot unexpectedly, and the OS will not show the CPUs as available if the CPU was truly disabled by the BIOS.

This has become a service issue because engineers have been replacing Blades incorrectly due to these false failure signs.  This has in turn caused a shortage of X6250 Blades in some Regions.

Contributing Factors

All Sun Blade X6250 Server Modules running with earlier than firmware/BIOS SW1.3 are impacted by this issue.

Symptoms

Example SEL log error listed below;

   Nonrecoverable ,2008/03/06 15:04:01 ,Processor 0 thermal trip detected

Example ipmitool sel elist output below;

   # ipmitool -H x.x.x.x -U root sel elist
     53 | 12/05/2007 | 14:15:43 | Processor Processor 0 | Thermal Trip | Asserted
     8b | 01/28/2008 | 08:46:14 | Processor Processor 0 | Thermal Trip | Asserted
     c6 | 02/26/2008 | 11:47:15 | Processor Processor 0 | Thermal Trip | Asserted

On systems that show CPUs disabled within the ELOM you will see;

   -> show CPU0

   /SYS/CPU/CPU0
       Targets:

       Properties:
           Designation = CPU 0
           Manufacturer = Intel
           Name = Clovertown
           Speed = 2333MHz
           Status = disabled

Root Cause

There are two root causes to this issue.  The first which includes Thermal Trips on Clovertown CPUs is due to an ELOM firmware issue.  During Power ON/OFF of the system there is some signal noise on LN93 which ELOM is reading as a false thermal trip.  To resolve ELOM has added temperature judgement to fix the false thermal trip readings.

The second which includes the Thermal Trips on Harpertown CPUs is due to a BIOS issue.  Recent changes within the ELOM command structure have caused BIOS to be out of sync with the structure changes which causes the CPUs to show disabled.  This will be resolved by syncing up BIOS and ELOM command structure.

Corrective Action

Workaround:

If it is determined the Thermal Trip errors being experienced are false then the errors can be ignored and nothing should be done until the SW release resolving this issue is available.

Resolution:

This issue will be resolved with the next release of firmware/BIOS SW1.3.

References:

Escalation ID:  65875681, 1-23543493


For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

For Sun Authorized Service Providers go to:

In addition to the above you may email:


Internal Contributor/submitter
Michael.Tabor@Sun.COM

Internal Eng Responsible Engineer
Gyanesh.Sharma@Sun.COM Responsible Manager: Subban.Raghunathan@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Eng Business Unit Group
NSG (Network Systems Group)

Internal Sun Alert & FAB Admin Info
21-May-2008: Completed draft and sent to Extended Review.
23-May-2008: Incorporated feedback from Ext Rvw and sending to Publish.
17-Dec-2009: Replaced Product with Swordfish Nomenclature


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback