Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1020827.1
Update Date:2010-08-27
Keywords:

Solution Type  FAB (standard) Sure

Solution  1020827.1 :   Intermittent Sun Fire X4500 system hangs with watchdog timeouts.  


Related Items
  • Sun Fire X4500 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
265588


Bug Id
<SUNBUG: 6746949>

Product
Sun Fire X4500 Server

Date of Resolved Release
12-Aug-2009

Intermittent Sun Fire x4500 system hangs with watchdog timeouts (see details below).

Affected Parts:

371-0856-xx   2.6GHz Dual Core CPU, AMD Opteron 285 E6 Stepping (95Watt), RoHS:Y
371-1779-xx   2.8GHz Dual Core CPU, AMD Opteron 290 E6 Stepping, RoHS:YL

Impact

System hangs have been observed in certain workloads and 2P configurations with AMD Opteron processors (Rev E) from the 0Fh revision E6.

Contributing Factors

Sun Fire x4500 systems containing either of above listed Affected Parts and running Solaris 10 U4-U7 (ZFS) are impacted by this issue.

Symptoms

The expected behavior is a system hard hang requiring a power cycle to reset.  Running HDT cannot break into the CPUs for analysis.  At times the system would becomes sluggish, responding to a few commands before the hard hang would occur.

The SEL log will show nothing, since system has frozen. A sync flood reset cannot happen, and BIOS cannot report anything.

Root Cause

Debug information has shown that a probe message has hung within the CPU.  While a definitive root cause is not known at this time, evidence points to a possible contention between the TLB miss resolution hardware and a probe has caused the system to hang.  This debug information is further backed up by experimental evidence that the hang does not occur when the workaround is applied.

Corrective Action

Workaround:
 
Upgrade to SW 1.6 (or greater).  If a customer experiences a Watchdog Time Out hang, use the following workaround:
 
The AMD recommendation is to disable caching of page table data in the L2.  The default BIOS setting enables TLB caching.  Under limited testing, a 3 to 5% performance loss was observed.

  SW 1.6 (x4500) - BIOS 0ABIG024
 
  BIOS setup option:
 
  F2 --> Advanced
       --> CPU Configuration
       --> Force TLB Caching disabled = Enabled
 
Resolution:
 
The field should escalate the case to TSC if the workaround is not acceptable to the customer.
 
Identification of Affected Parts (how to):
 
The Sun Fire x4500 has two CPU FRUs:
     
  371-0856  2.6GHz Dual Core CPU, AMD Opteron 285 E6 Stepping (95Watt), RoHS:Y
  371-1779  2.8GHz Dual Core CPU, AMD Opteron 290 E6 Stepping, RoHS:YL
 
Both of these CPUs are REV E and E6 Stepping.

Comments

This issue was evaluated by the Sun Alert PMO and found not to meet criteria.

References:
  
 Escalation ID: 1-24213193, 1-24512721
 Resolution Patches: SW 1.6
 Related URL(s):  http://www.sun.com/servers/x64/x4500/downloads.jsp


For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

For Sun Authorized Service Providers go to:

In addition to the above you may email:


Internal Contributor/submitter
Greg.Huff@Sun.COM

Internal Eng Responsible Engineer
Michael.Louie@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Eng Business Unit Group
SSG WGS (Workgroup Systems)

Internal Sun Alert & FAB Admin Info
07-Aug-2009: Completed draft and sent to Extended Review.
12-Aug-2009: No feedback from Ext Rvw - sending to Publish.
19-Nov-2009: Corrected Product Name to swoRDFish inconsistency.


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback