Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1020827.1 : Intermittent Sun Fire X4500 system hangs with watchdog timeouts.
PreviouslyPublishedAs 265588 Bug Id <SUNBUG: 6746949> Product Sun Fire X4500 Server Date of Resolved Release 12-Aug-2009 Intermittent Sun Fire x4500 system hangs with watchdog timeouts (see details below). Affected Parts: 371-0856-xx 2.6GHz Dual Core CPU, AMD Opteron 285 E6 Stepping (95Watt), RoHS:Y 371-1779-xx 2.8GHz Dual Core CPU, AMD Opteron 290 E6 Stepping, RoHS:YL ImpactSystem hangs have been observed in certain workloads and 2P configurations with AMD Opteron processors (Rev E) from the 0Fh revision E6.Contributing FactorsSun Fire x4500 systems containing either of above listed Affected Parts and running Solaris 10 U4-U7 (ZFS) are impacted by this issue.SymptomsThe expected behavior is a system hard hang requiring a power cycle to reset. Running HDT cannot break into the CPUs for analysis. At times the system would becomes sluggish, responding to a few commands before the hard hang would occur.The SEL log will show nothing, since system has frozen. A sync flood reset cannot happen, and BIOS cannot report anything. Root CauseDebug information has shown that a probe message has hung within the CPU. While a definitive root cause is not known at this time, evidence points to a possible contention between the TLB miss resolution hardware and a probe has caused the system to hang. This debug information is further backed up by experimental evidence that the hang does not occur when the workaround is applied.Corrective ActionWorkaround:Upgrade to SW 1.6 (or greater). If a customer experiences a Watchdog Time Out hang, use the following workaround: The AMD recommendation is to disable caching of page table data in the L2. The default BIOS setting enables TLB caching. Under limited testing, a 3 to 5% performance loss was observed. SW 1.6 (x4500) - BIOS 0ABIG024 BIOS setup option: F2 --> Advanced --> CPU Configuration --> Force TLB Caching disabled = Enabled Resolution: The field should escalate the case to TSC if the workaround is not acceptable to the customer. Identification of Affected Parts (how to): The Sun Fire x4500 has two CPU FRUs: 371-0856 2.6GHz Dual Core CPU, AMD Opteron 285 E6 Stepping (95Watt), RoHS:Y 371-1779 2.8GHz Dual Core CPU, AMD Opteron 290 E6 Stepping, RoHS:YL Both of these CPUs are REV E and E6 Stepping. CommentsThis issue was evaluated by the Sun Alert PMO and found not to meet criteria.References: Escalation ID: 1-24213193, 1-24512721 Resolution Patches: SW 1.6 Related URL(s): http://www.sun.com/servers/x64/x4500/downloads.jsp For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL: For Sun Authorized Service Providers go to: In addition to the above you may email: Internal Contributor/submitter Greg.Huff@Sun.COM Internal Eng Responsible Engineer Michael.Louie@Sun.COM Internal Services Knowledge Engineer Joe.Davis@Sun.COM Internal Eng Business Unit Group SSG WGS (Workgroup Systems) Internal Sun Alert & FAB Admin Info 07-Aug-2009: Completed draft and sent to Extended Review. 12-Aug-2009: No feedback from Ext Rvw - sending to Publish. 19-Nov-2009: Corrected Product Name to swoRDFish inconsistency. Attachments This solution has no attachment |
||||||||||||
|