Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1019389.1
Update Date:2010-09-16
Keywords:

Solution Type  FAB (standard) Sure

Solution  1019389.1 :   Hardware: A limited number of AMD Opteron "Rev F" CPUs in certain systems can cause instability in some specific configurations.  


Related Items
  • Sun Fire X4200 M2 Server
  •  
  • Sun Blade X6220 Server Module
  •  
  • Sun Blade X8420 Server Module
  •  
  • Sun Blade X8440 Server Module
  •  
  • Sun Fire X4100 M2 Server
  •  
  • Sun Ultra 40 M2 Workstation
  •  
  • Sun Fire X4600 M2 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
239146


Bug Id
<SUNBUG: 6587404>, <SUNBUG: 6466941>, <SUNBUG: 6651341>

Product
Sun Ultra 40 M2 Workstation
Sun Fire X4100 M2 Server
Sun Fire X4200 M2 Server
Sun Netra X4200 M2 Server
Sun Fire X4600 M2 Server
Sun Blade X6220 Server Module
Sun Blade X8420 Server Module
Sun Blade X8440 Server Module

Date of Resolved Release
20-Jun-2008

A limited number of AMD Opteron CPUs can cause instability in system operation (see details below).

Impact

A limited number of AMD Opteron "Rev F" Model 2xxx and 8xxx series (e.g. 2210, 8222, etc.) CPUs manufactured prior to March 2008, under very specific conditions, can cause instability in system operation.

Instability in this case refers to the following symptoms:

  • High levels of reported uncorrectable memory errors and/or correctable ECC errors, typically tagging a single DIMM pair

  • Memory Failures pointing to unpopulated memory slots

  • System reboots during BIOS POST, during OS boot, or during ongoing system operation in conjunction with uncorrectable memory errors and/or correctable ECC errors.

  • Repeated system hangs during the BIOS boot / Power On Self Test (POST) process not
    attributable to other causes.

AMD CPUs contain a PowerNow feature, which when enabled and activated may cause some systems to manifest the above symptoms. Disabling the PowerNow feature has proven to stabilize systems affected by this issue.

IMPORTANT NOTE: The symptoms above can also be caused by issues beyond the issue addressed by this FAB.  For example, DIMM problems can also generate the same or similar symptoms.  The steps outlined in the Corrective Action section MUST be followed to determine if the system instability is caused by the issue addressed by this FAB.

Contributing Factors

Sun system designs took advantage of certain memory/CPU performance tuning features in the CPU architecture.  However, the AMD manufacturing process did not adequately screen for this issue in all tuning configurations until March 2008.

The following system types and part numbers could be impacted:

Ultra 40 M2 (has up to two CPUs):   

Sun P/N          AMD OPN               Description
371-1911-01   OSA2210GAA6CQ   AMD Opteron Model 2210 @ 1.8 Ghz (F2 Step)
371-1981-01   OSA2214GAA6CQ   AMD Opteron Model 2214 @ 2.2 Ghz (F2 Step)
371-1913-01   OSA2218GAA6CQ   AMD Opteron Model 2218 @ 2.6 Ghz (F2 Step)
371-1914-01   OSY2220GAA6CQ   AMD Opteron Model 2220 SE @ 2.8 Ghz (F2 Step)
371-2495-01   OSA2210GAA6CX   AMD Opteron Model 2210 @ 1.8 Ghz (F3 Step)
371-2497-01   OSA2214GAA6CX   AMD Opteron Model 2214 @ 2.2 Ghz (F3 Step)
371-2500-01   OSA2218GAA6CX   AMD Opteron Model 2218 @ 2.6 Ghz (F3 Step)
371-2501-01   OSA2220GAA6CX   AMD Opteron Model 2220 @ 2.8 Ghz (F3 Step)
371-2502-01   OSA2222GAA6CX   AMD Opteron Model 2222 @ 3.0 Ghz (F3 Step)
371-2503-01   OSY2222GAA6CX   AMD Opteron Model 2222 SE @ 3.0 Ghz (F3 Step)
371-3487-01   OSY2224GAA6CX   AMD Opteron Model 2224 SE @ 3.2 Ghz (F3 Step)

X4100 M2 (has up to 2 CPUs):

Sun P/N          AMD OPN               Description
371-1911-01   OSA2210GAA6CQ   AMD Opteron Model 2210 @ 1.8 Ghz (F2 Step)
371-1912-01   OSA2216GAA6CQ   AMD Opteron Model 2216 @ 2.4 Ghz (F2 Step)
371-1913-01   OSA2218GAA6CQ   AMD Opteron Model 2218 @ 2.6 Ghz (F2 Step)
371-1914-01   OSY2220GAA6CQ   AMD Opteron Model 2220 SE @ 2.8 Ghz (F2 Step)
371-2495-01   OSA2210GAA6CX   AMD Opteron Model 2210 @ 1.8 Ghz (F3 Step)
371-2499-01   OSA2216GAA6CX   AMD Opteron Model 2216 @ 2.4 Ghz (F3 Step)
371-2500-01   OSA2218GAA6CX   AMD Opteron Model 2218 @ 2.6 Ghz (F3 Step)
371-2684-01   OSP2218GAA6CX   AMD Opteron Model 2218 HE @ 2.6 Ghz (F3 Step)
371-2501-01   OSA2220GAA6CX   AMD Opteron Model 2220 @ 2.8 Ghz (F3 Step)
371-2502-01   OSA2222GAA6CX   AMD Opteron Model 2222 @ 3.0 Ghz (F3 Step)
371-2503-01   OSY2222GAA6CX   AMD Opteron Model 2222 SE @ 3.0 Ghz (F3 Step)
371-3487-01   OSY2224GAA6CX   AMD Opteron Model 2224 SE @ 3.2 Ghz (F3 Step)

X4200 M2 (has up to 2 CPUs):

Sun P/N          AMD OPN               Description
371-1911-01   OSA2210GAA6CQ   AMD Opteron Model 2210 @ 1.8 Ghz (F2 Step)
371-1912-01   OSA2216GAA6CQ   AMD Opteron Model 2216 @ 2.4 Ghz (F2 Step)
371-1913-01   OSA2218GAA6CQ   AMD Opteron Model 2218 @ 2.6 Ghz (F2 Step)
371-1914-01   OSY2220GAA6CQ   AMD Opteron Model 2220 SE @ 2.8 Ghz (F2 Step)
371-2495-01   OSA2210GAA6CX   AMD Opteron Model 2210 @ 1.8 Ghz (F3 Step)
371-2499-01   OSA2216GAA6CX   AMD Opteron Model 2216 @ 2.4 Ghz (F3 Step)
371-2500-01   OSA2218GAA6CX   AMD Opteron Model 2218 @ 2.6 Ghz (F3 Step)
371-2684-01   OSP2218GAA6CX   AMD Opteron Model 2218 HE @ 2.6 Ghz (F3 Step)
371-2501-01   OSA2220GAA6CX   AMD Opteron Model 2220 @ 2.8 Ghz (F3 Step)
371-2502-01   OSA2222GAA6CX   AMD Opteron Model 2222 @ 3.0 Ghz (F3 Step)
371-2503-01   OSY2222GAA6CX   AMD Opteron Model 2222 SE @ 3.0 Ghz (F3 Step)
371-3487-01   OSY2224GAA6CX   AMD Opteron Model 2224 SE @ 3.2 Ghz (F3 Step)

X4600 M2 (has up to 8 CPUs):

Sun P/N          AMD OPN              Description
371-1832-01   OSA8218GAA6CR   AMD Opteron Model 8218 @ 2.6 Ghz (F2 Step)
371-1989-01   OSY8220GAA6CR   AMD Opteron Model 8220 SE @ 2.8 Ghz (F2 Step)
371-2479-01   OSA8216GAA6CY   AMD Opteron Model 8216 @ 2.4 Ghz (F3 Step)
371-2506-01   OSA8218GAA6CY   AMD Opteron Model 8218 @ 2.6 Ghz (F3 Step)
371-2507-01   OSP8218GAA6CY   AMD Opteron Model 8218 HE @ 2.6 Ghz (F3 Step)
371-2480-01   OSA8220GAA6CY   AMD Opteron Model 8220 @ 2.8 Ghz (F3 Step)
371-2509-01   OSA8222GAA6CY   AMD Opteron Model 8222 @ 3.0 Ghz (F3 Step)
371-3488-01   OSY8224GAA6CY   AMD Opteron Model 8224 SE @ 3.2 Ghz (F3 Step)

Sun Blade X6220 (has up to 2 CPUs):

Sun P/N          AMD OPN               Description
371-2496-01   OSA2212GAA6CX   AMD Opteron Model 2212 @ 2.0 Ghz (F3 Step)
371-2500-01   OSA2218GAA6CX   AMD Opteron Model 2218 @ 2.6 Ghz (F3 Step)
371-2501-01   OSA2220GAA6CX   AMD Opteron Model 2220 @ 2.8 Ghz (F3 Step)
371-2502-01   OSA2222GAA6CX   AMD Opteron Model 2222 @ 3.0 Ghz (F3 Step)
371-3487-01   OSY2224GAA6CX   AMD Opteron Model 2224 SE @ 3.2 Ghz (F3 Step)

Sun Blade X8420 (has up to 4 CPUs):

Sun P/N          AMD OPN               Description
371-1832-01   OSA8218GAA6CR   AMD Opteron Model 8218 @ 2.6 Ghz (F2 Step)
371-1989-01   OSY8220GAA6CR   AMD Opteron Model 8220 SE @ 2.8 Ghz (F2 Step)
371-2479-01   OSA8216GAA6CY   AMD Opteron Model 8216 @ 2.4 Ghz (F3 Step)
371-2506-01   OSA8218GAA6CY   AMD Opteron Model 8218 @ 2.6 Ghz (F3 Step)
371-2507-01   OSP8218GAA6CY   AMD Opteron Model 8218 HE @ 2.6 Ghz (F3 Step)
371-2480-01   OSA8220GAA6CY   AMD Opteron Model 8220 @ 2.8 Ghz (F3 Step)

Sun Blade X8440 (has up to 4 CPUs):

Sun P/N          AMD OPN               Description
371-2509-01   OSA8222GAA6CY   AMD Opteron Model 8222 @ 3.0 Ghz (F3 Step)

Netra X4200 M2 (has up to 2 CPUs):

Sun P/N          AMD OPN               Description
371-2630-01   OSP2214GAU6CX   AMD Opteron Model 2214 HE @ 2.2 Ghz (F3 Step)


Note: X2100 M2 and X2200 M2 platforms, although they do utilize similar AMD CPUs, are not affected by this FAB because PowerNow is disabled on these systems.  Depending on the BIOS FW level on these platforms, there may be a BIOS setting present in the BIOS setup which could imply PowerNow functionality.  However, despite this and regardless of the setting, PowerNow is always disabled on these platforms.

Symptoms

Systems could experience any or all of the following symptoms if PowerNow is enabled. (Disabling the PowerNow feature has proven to stabilize systems affected by this issue.)

  • High levels of reported uncorrectable memory errors and/or correctable ECC errors, typically tagging a single DIMM pair

  • Memory Failures pointing to unpopulated memory slots

  • System reboots during BIOS POST, during OS boot, or during ongoing system operation in conjunction with uncorrectable memory errors and/or correctable ECC errors.

  • Repeated system hangs during the BIOS boot / Power On Self Test (POST) process that are not attributable to other causes.

IMPORTANT NOTE: The symptoms above can also be caused by issues beyond the issue addressed by this FAB.  For example, DIMM problems can also generate the same or similar symptoms.  The steps outlined in the Corrective Action section MUST be followed to determine if the system instability is caused by the issue addressed by this FAB.

Root Cause

PowerNow is a power-saving technology within AMD processors. The CPU speed and Vcore are decreased while the system is under low load or idle to save power and to reduce heat and noise.   PowerNow must be enabled and must become activated to encounter this issue, but PowerNow is not the source of the issue.

The AMD Opteron processor makes use of DLLs (Delay Locked Loops) to control the precise timing of memory address, command and control signals relative to memory clocks.  This allows the guarantee of optimal timing margin across processor, voltage, timing and frequency variations.  Sun is one of very few companies using these settings to optimize memory performance to deliver the best possible performance for its customers.  It was discovered that AMD manufacturing changes, made to optimize processor yields, resulted in an encroachment into those DLL settings reducing the level of margin within which the system could be guaranteed to function optimally.

In certain memory configurations with a unique combination of selected timing characteristics some systems may experience some instability when PowerNow becomes enabled.  Not all systems will experience this syndrome.

In March 2008, AMD implemented a factory screen to recover the margins expected by Sun's DLL tuning implementations.   This factory screen ultimately prevents the issue.

Note:  Because of the timing margin aspects of this issue, and the inter-operational relationship with motherboards & memory, not all CPUs manufactured prior to March 2008 will manifest this issue.  CPUs manufactured prior to March 2008 are only more -susceptible- to encountering the issue; The projected rate of encounter within the entire field population is actually quite low.  Stable systems tend to remain stable.

Disabling the PowerNow feature has proven to stabilize systems with the above symptoms, regardless of when the CPU(s) were manufactured or if they have been factory screened.

Corrective Action

Replacement Time Estimate: 10 minutes (per CPU)
Hot Swappable:
No

Resolution:


1. First, if BIOS is not up to date, update the system BIOS to the latest version available from the sun.com website. There have been many recent modifications to memory timing on various platforms and BIOS update may provide the only correction that the system needs.  Systems must be running the latest BIOS version to be considered eligible for further remediation under this FAB.  If the customer has experienced service interruption as described in this FAB, and the BIOS needs updating, Sun's recommendation to update the BIOS and also disable PowerNow per resolution item #2, below.

2. If the system is already running the latest BIOS and the instability issue(s) are still occurring, disable PowerNow in the system BIOS.  Refer to applicable product documentation for procedures describing how to do this on your specific platform.  Refer to the COMMENTS section of this FAB for links to Product Documentation.

2.1. If disabling PowerNow stabilizes the system then suggest the customer to leave PowerNow disabled as a long term solution.  Disabling PowerNow may be preferable when compared to the inconvenience of swapping CPUs.  Sun's recommendation is to update the system BIOS and disable PowerNow to avoid this issue wherever possible.  If customer agrees to leaving PowerNow disabled, STOP; The rest of the instructions in this FAB are not necessary.

2.2. If the system remained unstable after updating the BIOS and disabling PowerNow, STOP; This FAB does not apply to your situation.

3. If updating the BIOS and disabling PowerNow resulted in system stability but the customer refuses to disable PowerNow as a long term solution, follow the instructions below for replacement of the CPUs in the problematic system.

3.1. Verify system is programmed with the most recent BIOS revision.

3.2. Verify that the system exhibits any or all of the symptom(s) with PowerNow enabled and verify the system DID NOT exhibit the symptom(s) with PowerNow disabled  (Capture system event/error logs as evidence).

3.3. Request "Advance Replacement" CPUs per the procedure outlined below:

Important: Do not remove CPUs or heatsinks from the system until you have received replacement CPUs.

Note: It may not always be possible to identify a specific problematic CPU on multi-CPU platforms.  If that is the case, the total number of CPUs that are installed on the platform may be ordered.

3.3.1. Complete the applicable 'CPU Request' and 'CPU Tracking Data' portions of the DLL-CPU-tracker.ods template, which is available via the below link;

  http://sdpsweb.central/FIN_FCO/FAB/239146/SPE/DLL-CPU-tracker.ods

3.3.2. Create an email with the following information:
    Address the email to AMD-REV-F-DLL@Sun.COM
    Enter Subject line: 'DLL RMA Request for [customer name, case ID#]'
    Enter in email body:
       Customer Company Name
       Customer Contact Name
       Customer Location
       Sun Contact Name
       Sun Contact Phone
       Sun Contact email
       Complete Ship-to Address
       Complete OPN Part Number & Quantity of Affected CPUs (*)
       Complete OPN Part Number & Quantity of CPUs Requested (*)

(*) Note: In some cases, the replacement CPU will differ from the orignial CPU.  Refer to the 'Replacement Matrix' tab in the DLL CPU Tracker to determine which replacement CPU to order.

3.3.3. Attach the partially completed DLL CPU Tracker document
3.3.4. Attach supporting event/error logs
3.3.5. Send the email
3.3.6. Upon validation of your event/error feedback, AMD will ship:

  - Replacement CPUs that have had the screen applied
  - thermal grease
  - alcohol wipes
  - Return shipping instructions & documentation via email response, including
     -- RMA number
     -- An updated DLL CPU Tracker template

3.3.7. Once new CPUs arrive on-site:

3.3.8. Read & adhere to the AMD packaging & handling guidelines: AMD-DLL-CPU-HandlingPackagingGuidelines.pdf available via the below link;

  http://sdpsweb.central/FIN_FCO/FAB/239146/SPE/AMD-DLL-CPU-HandlingPackagingGuidelines.pdf

3.3.9. Identify 'potentially affected' versus 'not-affected' CPUs:

It is the FE's responsibility to ensure that only "potentially affected" CPUs are replaced in accordance with this FAB.  CPUs that have already been factory screened will bear one or both of the following markings on the cover of the CPU (refer to the CPU Reference 'Photos' tab in the DLL CPU Tracker.)

If the CPU bears a "-" etch mark following the OPN (on the first line of alphanumeric text on the CPU cover) the CPU has been screened and is not affected by this issue.  Do not replace CPUs that bear this mark.

If the CPU bears a "P" as the first character in the second line of alphanumeric text on the CPU cover,  the CPU has been screened and is not affected by this issue.  Do not replace CPUs that bear this mark.

If the CPU does not bear either of the markings described above, it has not been factory screened and it is potentially affected by this issue.  This CPU may be replaced in a problematic system. 

3.3.10. Remove suspect CPUs from the system (refer to applicable product documentation for instructions.)

3.3.11. Use the alcohol wipes provided by AMD to thoroughly clean the used thermal grease from the bottom of the heatsink and lid of the CPU.  Each thermal grease syringe provided by AMD has sufficient grease for the application of (1) CPU.

3.3.12. Capture all CPU and slot information in the 'CPU Tracking Data' tab in the DLL CPU Tracker.

3.3.13. Apply thermal grease, re-attach the heatsink to the replacement CPU and reinstall the CPU in the system.

3.3.14. Return the suspect CPU, any other replaced CPUs taken from other slots and any unused CPUs to AMD per RMA instructions.  Be sure to follow proper packaging and handling as well as labeling guidelines per the AMD CPU Handling & Packaging document.  To ship suspect CPUs back to AMD, label the package with the RMA number provided by AMD and ship to: 

  AMD
  5900 East Ben White Blvd,  M/S 574
  Austin, TX 78741
  DLL RMA#: _______ (provided earlier by AMD)
  QTY: ____
  Attention: Ed Zahradnik

3.3.15 Send an email containing the AWB number and the updated DLL CPU Tracker file to AMD-REV-F-DLL@Sun.COM.  It is recommended to reply to previous email threads for continuity and ease of tracking.

Note:  All original suspect CPUs and unused replacements MUST be returned on a 1 for 1 basis.  Failure to return CPUS will result in the appropriate Sun Field Service organization to be billed for the advance replacement CPUs.

Comments & Special Considerations

Product Documentation Links:

Sun Blade X6220 Server Module Documentation
Sun Blade 8000 Modular System Documentation
Sun Fire X4100 M2 Server Documentation
Sun Fire X4200 M2 Server Documentation
Sun Fire X4600 M2 Server Documentation
Netra X4200 M2 Server Documentation
Sun Ultra 40 M2 Workstation Documentation

This issue was evaluated as, and determined not to meet criteria for, an FCO due to the low potential of exposure involving very specific configurations and because all CPUs are to be acquired through AMD.

In some cases, the customer may be a Sun Field Engineer responsible for servicing the customer account who is handling the receipt/return of CPUs.

AMD will pay for the freight for each movement to and from the customer site.

There will be no charge to customers for any on-site activities or materials used related to this Field Action Bulletin.

Replacement CPUs will not be stored at Sun RSLs.  Instead, AMD will provide the logistics support and CPU shipments directly to/from customer sites.

This FAB will remain effective and AMD will provide CPUs in support of this activity until June 30, 2009.

For replacement materials sent from AMD to the customer site:
 Shipment terms: CIP (Carriage and Insurance Paid to customer destination)
 Exporter of record: AMD
 Importer of record*: Sun Microsystems
 Declared value of the shipment: AMD's current market price for the respective Ordering Part Number (OPN)

For replaced material returning to AMD:
 Shipment terms: FCA (Free Carrier - customer pick up location)
 Exporter of record: Sun Microsystems
 Importer of record*: AMD
 Declared value of the shipment: AMD's current market price for the respective OPN

* Importer or Record Pays VAT, if applicable


References:

Escalation ID: 44303982
Radiance Cases: 38069399
Other FABs: 231245
Sun Alerts: 201246
Stop Ship Purge: P001-20507
Related URL(s): http://sdpsweb.central/FIN_FCO/FAB/239146/SPE/DLL-CPU-tracker.ods
http://sdpsweb.central/FIN_FCO/FAB/239146/SPE/AMD-DLL-CPU-HandlingPackagingGuidelines.pdf


For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

For Sun Authorized Service Providers go to:

In addition to the above you may email:


Internal Contributor/submitter
Daryl.Hinz@Sun.COM, Jake.Bell@Sun.COM

Internal Eng Responsible Engineer
John.Nerl@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Eng Business Unit Group
SSG WGS

Internal Sun Alert & FAB Admin Info
18-Jun-2008: Finalized FAB draft and sent to Extended Review.
20-Jun-2008: Incorporated feedback from Ext Rvw - sending to Publish.
17-Dec-2009: Replaced Product with Swordfish Nomenclature


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback