Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1017422.1 : Hardware: A limited number of AMD Opteron CPUs in X4100, X4200, X4500 and X4600 systems can cause unexpectedly system shut down without warning or trace evidence.
PreviouslyPublishedAs 228519 Product Sun Fire X4100 Server Sun Fire X4600 Server Sun Fire X4200 Server Sun Fire X4500 Server Bug Id <SUNBUG: 6439409> (X4100/X4200) <SUNBUG: 6515060> (X4600) Date of Resolved Release 24-DEC-2006 Impact A limited number of AMD Opteron Revision E CPUs manufactured prior to December 24, 2006, under specific conditions, can generate a false internal temperature reading which can cause the platform to power down without warning or trace evidence. AMD CPUs contain an internal CPU thermsense circuit, called ThermSenseMacro (TSM). This TSM circuit is designed to protect the CPU and system from over-temperature conditions. A small number of AMD's single and dual core Revision E CPU TSMs (manufactured prior to week 52, 2006) may generate a false temperature reading above the 125C set point and induce a platform power down. AMD Failure analysis data indicates a full field Defects Per Million rate of well below 500, i.e. 0.05%. Contributing Factors Operation at cooler CPU case temperatures combined with the execution of applications that generate high levels of floating point and memory access activity with AMD Opteron Single and Dual Core Revision E Series 200 and 800 CPUs. The following system types and part numbers could be impacted: X4500 (has 2 CPUs): Sun P/N AMD OPN Description 371-0856-01 OSA285FAA6CB 285 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP X4600 (has up to 8 CPUs): Sun P/N AMD OPN Description 370-7961-01 OSA854FAA5BM 854 AMD Operton CPU (2.8 GHZ) - E STEP 371-1759-01 OSA856FAA5BM 856 AMD Operton CPU (3.0 GHZ) - E STEP 371-0291-01 OSA880FAA6CC 880 AMD Opteron Dual Core CPU (2.4 GHZ) - E STEP 371-1760-01 OSA885FAA6CC 885 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP X4100 and X4200 (has 1 or 2 CPUs): Sun P/N AMD OPN Description 370-7711-01 OSA248FAA5BL 248 AMD Opteron CPU (2.2 GHZ) - E STEP 370-7937-01 OSA252FAA5BL 252 AMD Opteron CPU (2.6 GHZ) non-RoHS - E STEP 370-7272-01 OSA252FAA5BL 252 AMD Opteron CPU (2.6 GHZ) - E STEP 370-7934-01 OSA252FAA5BL 252 AMD Opteron CPU (2.6 GHZ) RoHS - E STEP 370-7962-01 OSA254FAA5BL 254 AMD Opteron CPU (2.8 GHZ) - E STEP 371-1776-01 OSA256FAA5BL 256 AMD Opteron (3.0 GHZ) - E STEP 370-7798-01 OSA265FAA6CB 265 AMD Opteron Dual Core CPU (1.8 GHZ) - E STEP 370-7799-01 OSA270FAA6CB 270 AMD Opteron Dual Core CPU (2.0 GHZ) - E STEP 370-7800-01 OSA275FAA6CB 275 AMD Opteron Dual Core CPU (2.2 GHZ) - E STEP 371-0839-01 OSA280FAA6CB 280 AMD Opteron Dual Core CPU (2.4 GHZ) RoHS - E STEP 95 Watt 370-7938-01 OSY280FAA6CB 280 AMD Opteron Dual Core CPU (2.4 GHZ) - E STEP 120 Watt 371-0856-01 OSA285FAA6CB 285 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP 95 Watt 371-0935-01 OSY285FAA6CB 285 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP 120 Watt 371-1779-01 OSA290FAA6CB 290 AMD Opteron Dual Core CPU (2.8 GHZ) - E STEP 95 Watt This FAB does not support CPU replacements for the V40z platforms because that platform captures THERMTRIP events in the SEL, and because we have not had any confirmed false THERMTRIPs on the V40z platform. Symptoms Platform will be powered down without any warning and go into standby mode. SP log or the SEL entries will not have any indication of the cause of the power down (except for X4600 systems running with BIOS 44). Root Cause An AMD CPU manufacturing test process did not adequately screen TSM faults. On December 24, 2006 (calendar week 52), AMD implemented a new manufacturing test process that separates out suspect CPUs. A Regional Stocking Location (RSL) purge will not be implemented due to the extremely low potential for experiencing this issue. Special Considerations: There will be no charge to customers for any onsite activities or materials used related to this Field Action Bulletin. Based on the successful CPU replacements at the five HPC sites, Sun field engineers are expected to replace CPUs for this FAB. CPU daughter card replacements are not funded or supported by AMD. Replacement CPUs will not be stored at Sun RSLs. Instead, AMD will provide the logistics support and CPU shipments directly to/from customer sites. AMD will provide CPUs in support of this activity until April 30, 2008. Resolution Replacement Time Estimate: 15 minutes (per CPU) Hot Swappable: No Special Considerations: Sun's Systems Group Quality Office, in advance of this FAB, actively engaged AMD to provide root cause. Replacement CPUs were provided to five high performance computing (HPC) grid accounts that experienced false THERMTRIP events. Affected CPUs were replaced by Sun field engineers. The purpose of this FAB is to provide support for other customers who experience spurious power downs due to the false THERMTRIP event. Customer does not need to be under contract to have their product repaired if affected by this issue. A BIOS upgrade for X4600 is available which includes diagnostics to confirm a THERMTRIP event. This FAB includes instructions for diagnosing X4600 systems both with and without the BIOS upgrade, should the customer decide to not upgrade to BIOS 44. Final Resolution: 1. Verify that the system has reset to a standby condition. 2. If the system has rebooted, this is not a false THERMTRIP issue. Stop, this FAB does not apply to your event. 3. Check SP logs for SEL entries by entering the following IPMI command: ipmitool -I lanplus -H <SP_IP address> -U root sel elist <After entering return you will be prompted for the ILOM password> 4. Verify that that the BIOS screen does not show 'log full'. If it does, your SP is unable to store new events, and the log is unusable to diagnose whether you have had a false THERMTRIP event. To clear the log, go to the BIOS setup screen and follow the instructions to clear the log. The SP log can store a large number of entries and will reach this 'full' condition only under unusual conditions/multiple issues. 5. For X4100, X4200, X4500, and X4600 without the BIOS 44 upgrade: An SEL entry is not captured for THERMTRIP events on Galaxy platforms, (with the exception of X4600 with BIOS 44 upgrade, see section 6). A false THERMTRIP event can only be identified by ruling out other power-down events: true over-temperature conditions and manual power downs. Ask the customer if any manual intervention to power down the system occurred. If so, the SEL entries associated with the manual power down should be ignored, and not considered a false THERMTRIP event. Verify that the system did not experience a true over-temperature condition caused by the environment. In a true over-temp condition, the system reacted properly and shutdown as expected. This condition is not a false THERMTRIP, and therefore does not apply to this FAB. The logs would contain temperature thresholds being exceeded before the platform powered down. There are three thresholds: Upper Non Recoverable, Upper Critical and Upper Non Critical. You should see these thresholds in the SEL as they are exceeded. Note: if SEL contains the following information, this FAB does not apply. Over-temperature Output Example: 1f04 | 05/11/2007 | 11:10:38 | Temperature p0.t_core | Upper Critical going high | Reading 68 > Threshold 67 degrees C 2004 | 05/11/2007 | 11:10:43 | Processor p0.fail | Predictive Failure Asserted 2104 | 05/11/2007 | 11:11:51 | Temperature p1.t_core | Upper Critical going high | Reading 68 > Threshold 67 degrees C 2204 | 05/11/2007 | 11:11:55 | Processor p1.fail | Predictive Failure Asserted 2304 | 05/11/2007 | 11:12:31 | Temperature p0.t_core | Upper Non-recoverable going high | Reading 76 > Threshold 75 degrees C 2404 | 05/11/2007 | 11:13:08 | Power Supply ps0.pwrok | State Deasserted ** 2504 | 05/11/2007 | 11:13:10 | Power Supply ps1.pwrok | State Deasserted ** ** System has been forcefully shut down by the SP. Note: Time stamps from SEL LOG are GMT time by default. If you have ruled out a manual power down (verified by the customer), a true over-temperature condition (as described in the example above), and there are no other conditions recorded in the SEL logs to explain the power down, then proceed with this FAB. 6. For X4600 with optional BIOS 44 Upgrade installed (available only on X4600): The BIOS 44, 0ABHA044, will provide an SEL entry for a THERMTRIP event (both false THERMTRIP and true over-temperature events). 6.1. When a THERMTRIP error occurs, the system will power down by default. 6.2. If the user powers on the system, the BIOS will detect the error and display an error message in three locations: 1. POST: "A Thermal Event from SouthBridge occurred on last boot" 2. DMI event log (in F2 Setup): "A Thermal Event from SouthBridge occurred on last boot" 3. IPMItool: 1800 | 02/21/2007 | 11:04:42 | Processor | Thermal Trip | Asserted 6.3. If you have the above POST/DMI/IPMItool messages and there were no other warnings by the Service Processor (SP) on over temperature of the ambient condition (a real THERMTRIP condition) just prior to the shutdown, then you have affected CPUs which should be replaced, after confirming the Datecode (reference step 9.4). 7. Verify your system has an affected AMD CPU by ipmi command: ipmitool -I lanplus -H <SP_IP address> -U root fru print <After entering return you will be prompted for the ILOM password> Example of Output: FRU Device Description : p0.fru (ID 6) Product Manufacturer : ADVANCED MICRO DEVICES Product Name : DUAL CORE AMD OPTERON(TM) PROCESSOR 290 Product Part Number : 0F21 Product Version : 02 FRU Device Description : p1.fru (ID 7) Product Manufacturer : ADVANCED MICRO DEVICES Product Name : DUAL CORE AMD OPTERON(TM) PROCESSOR 290 Product Part Number : 0F21 Product Version : 02 Product Name should match one of the CPU numbers listed in the Contributing Factors section above. 8. Identification of Affected Parts before CPUs arrive onsite: 8.1. Verify the system symptoms match the "Final Resolution" requirements. 8.2. Do not remove the heatsink or CPU until you have received replacement CPU's. To minimize the amount of downtime at the customer site, CPU's will be shipped in advance of opening the system. To request CPUs: 1. Complete the 'TSM-CPU-Tracker' template located at... http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/TSM-CPU-tracker.ods Note: Browser may show garbage on screen depending on your browser settings. If this occurs perform a File -> Save Page As to your disk, then open it from your local disk. 2. Create an email with the following information: 1. Address the email to AMD-REV-E-TSM@sun.com 2. Enter in the Subject line: 'TSM RMA Request' 3. Enter in email body: 1. Customer Company Name 2. Customer Contact Name 3. Customer Location 4. Sun Contact Name 5. Sun Contact Phone 6. Sun Contact email 7. Complete Ship-to Address 8. Complete OPN Part # & Quantity Requested Currently it is not possible to identify the failing CPU, so the total number of CPU's that are installed on the platform will need to be ordered. Note: not all CPUs will need to be fitted (see section 9.1). 9. If you have ruled out other thermtrip possible causes such as those that would be recorded in the SEL logs, then proceed with the FAB. 4. Attach partially completed 'TSM-CPU-Tracker' template 3. Send email 8.3. Upon validation of your SEL feedback, AMD will ship: 1. CPUs, thermal grease and alcohol wipes 2. Return shipping documentation via email response, including 1. RMA number 2. An updated 'TSM-CPU-Tracker' template 8.4. AMD detailed handling, ESD requirements, packing, CPU removal, and installation instructions are located at... http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/AMD_Handling_070502.pdf
9. Identification of Affected Parts after the CPUs arrive onsite: 9.1. System design does not allow us to identify the specific offending CPU, so a visual check of each CPU is required BEFORE removing the CPU. 9.2. Follow AMD handling instructions for careful removal of CPU heatsink and CPU, per the 'TSM Field Remediation Process for Sun Microsystems' document found via the below link... http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/AMD_Handling_070502.pdf Note: Browser may show a blank screen depending on your browser settings. If this occurs perform a File -> Save Page As to your disk, then open it from your local disk. 9.3. Capture all CPU and slot information in the 'TSM-CPU-Tracker' template located via the below link... http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/TSM-CPU-tracker.ods Note: Browser may show garbage on screen depending on your browser settings. If this occurs perform a File -> Save Page As to your disk, then open it from your local disk. 9.4. Verify the Datecode of each CPU: 1. Reference CPU photo located in either the TSM-CPU-Tracker or Handling instructions for location of Datecode & 'screening mark'. 2. Affected CPUs have Datecodes of 0651 or earlier (0650, 0649, 0648, ...) 3. CPUs with Datecodes later than 0651, or CPUs that have been etched with the 'screening mark', should not be removed. 4. It is the FE's responsibility to ensure that only "affected" CPUs are removed. 5. Use the alcohol wipes provided by AMD to thoroughly clean the used thermal grease from the bottom of the heatsink and lid of the CPU. Each thermal grease syringe provided by AMD has sufficient grease for the application of (1) CPU. For detailed CPU installation instructions, please reference the 'TSM Field Remediation Process for Sun Microsystems' document located via the below URL... http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/AMD_Handling_070502.pdf 6. Reattach the heatsink to the original CPU and return the 'good' CPU back to AMD, along with any replaced CPUs taken from other slots. 9.5. Install new CPU for Datecode-validated or rescreen-validated CPU removals. 9.6. Pack and Ship CPUs per AMD handling instructions document. 1. Label package with the RMA number and ship to: AMD 5204 E. Ben White Blvd, MS 574 Austin, TX 78741 USA Attn: Ed Zahradnik TSM RMA # : 6XXX XXXX [8 Digits] QTY : ____ 2. Send the AWB and TSM-CPU-Tracker file to AMD-REV-E-TSM@Sun.COM
Previously Published As 102880 Comments This issue was evaluated as, and determined not to meet criteria for, an FCO due to the low potential of exposure involving very specific configurations. For replacement materials sent from AMD to the customer site: AMD assumes all freight and customs costs and, therefore, will pay for the freight for each movement to and from the customer site. Related Information
Internal Contributor/submitter Daryl.Hinz@Sun.COM, Kim.Mayman@Sun.COM Internal Eng Business Unit Group KE Authors Internal Eng Responsible Engineer Derek.Tsai@Sun.COM, Michael.Louie@Sun.COM Internal Services Knowledge Engineer Joe.Davis@Sun.COM Internal Escalation ID 1-16348496, 1-17642328, 1-17935330, 1-20735147, 1-20835776 Internal Kasp FAB Legacy ID 102880 Internal Sun Alert & FAB Admin Info Critical Category: Significant Change Date: 2007-05-15 Avoidance: Hardware Responsible Manager: Rhett.Brikovskis@Sun.COM Original Admin Info: WF - Initiated draft and awtg feedback from questions asked during intial review. - Joe 4/11/07 WF - awtg a resubmission from Kim Mayman, who is waiting on updated info from Duncan Morton, Sun Supply Chain Manager for AMD products. - Joe 4/20/07 WF - FAB resubmitted by sponsor w/updates - Joe 4/23/07 WF - finalized draft and sent to extended review - Joe 5/2/07 WF - updated by submitter, still in review - Joe 5/4/07 WF - final updates per submitter, sending to publsih - Joe 5/15/07 WF - FAB not showing as published. Put word "Hardware" at the beginning of Synopsis and will republish. - Joe 5/16/07 WF - corrected impitool command in Resolution. - Joe 5/17/07 WF - added "| Asserted" to end of step 6.2.3. - Joe 5/22/07 Product_uuid 54e2ac49-df71-11d9-89e6-080020a9ed93|Sun Fire X4100 Server 72cdbb85-7cd3-11da-8990-080020a9ed93|Sun Fire X4600 Server c6e795ef-df6f-11d9-89e6-080020a9ed93|Sun Fire X4200 Server f4bbfa5f-e6e5-11da-ac3d-080020a9ed93|Sun Fire X4500 Server Attachments This solution has no attachment |
||||||||||||
|