Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1010905.1
Update Date:2011-04-13
Keywords:

Solution Type  Technical Instruction Sure

Solution  1010905.1 :   Sun Enhanced Memory DIMM Replacement Policy for SPARC  


Related Items
  • Sun Enterprise 4500 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Blade 2000 Workstation
  •  
  • Sun Fire V250 Server
  •  
  • Sun Fire V480 Server
  •  
  • Sun Enterprise 5500 Server
  •  
  • Sun Enterprise 450 Server
  •  
  • Sun Fire 280R Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Ultra 450 Workstation
  •  
  • Sun Fire T2000 Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Ultra 30 Workstation
  •  
  • Sun Ultra 2 Workstation
  •  
  • Sun Ultra 80 Workstation
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Netra T1400 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Enterprise 6500 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Enterprise 220R Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Netra 20 Server
  •  
  • Sun Fire T1000 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Enterprise 250 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Blade 1000 Workstation
  •  
  • Sun Ultra 60 Workstation
  •  
  • Sun Fire V490 Server
  •  
  • Sun Enterprise 420R Server
  •  
  • Sun Enterprise 10000 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Boards>Memory Module
  •  
  • GCS>Sun Microsystems>Servers>NEBS-Certified Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Desktops>Workstations
  •  
  • GCS>Sun Microsystems>Servers>CMT Servers
  •  

PreviouslyPublishedAs
215045


Applies to:

Sun Enterprise 220R Server
Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Goal


Description
Sun Enhanced Memory DIMM Replacement Policy for SPARC

The rules detailed in this Policy apply to all supported machines that use the SPARC architecture..

NOTE: Acronyms used in this document and their definitions:

DIMM - Dual Inline Memory Module
UE - Uncorrectable Error
CE - Correctable Error
POST - Power On Self Test
DUE - Uncorrectable system bus data ECC error


Further definitions can be referenced in <Document:1004729.1> Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages

Solution

Sun's Sparc/Solaris DIMM Replacement Policy - Version 20100623

Note: The rules detailed in this Policy apply to all supported machines that use the SPARC architecture.

Replace a DIMM when:

1. Rule 1: POST (when run at a level which actually tests memory) fails it.

2. Rule 2: For systems with Predictive Self-Healing (Solaris 10 and later, except on UltraSPARC II-based platforms), when the system tells you to.

3. Rule 3: For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports a UE or DUE, and investigation shows that the UE or DUE truly originated from memory, and not from a transfer from some CPU's cache, as determined by a qualified Sun Support specialist.

4. Rule 4: For two or more CEs:
4.1 Rule 4A. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports two or more CEs from two or more different physical addresses on each of two or more different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative checkword (that is, the AFARs are all the same modulo 64). [Note: This means at least 4 CEs; two from one bit position, with unique addresses, and two from another, also with unique addresses, and the lower 6 bits of all the addresses are the same.]

4.2 Rule 4B. Analysis of DIMM failure rates and returns have indicated that there is an unacceptably high rate of  Rule 4b false positives. Therefore, the original Rule 4B is considered obsolete.

5. Rule 5: For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 and later:
5.1 Rule 5A.
5.1.1 Definition: the term "faulted page" denotes a page scheduled for retirement, whether or not that retirement has succeeded.
5.1.2 When the system indicates that a DIMM has accumulated 512 or more faulted pages AND the bad reader/writer check (defined below) FAILS, OR a DIMM has accumulated 128 or more faulted pages AND ( < physical address of highest faulted page > - < physical address of lowest faulted page > ) / ( < number of faulted pages > - 1 ) > 512KB AND the bad reader/writer check FAILS then replace the DIMM.
5.1.3 The bad reader/writer check is defined as follows:
5.1.3.1 For this DIMM and any other DIMM in the system, if they each have at least 4 ereports at unique addresses (unique per DIMM; depending upon the system design each DIMM could have the same address in an ereport) which have the same symbol position, AND if the number of pages faulted on the DIMM with the smaller number of pages faulted is greater than 1/16 times the number of pages faulted on the DIMM with the greater number of pages faulted, then the bad reader/writer check SUCCEEDS.
5.1.3.2 If, for all sets of this DIMM and any other DIMM in the system, the number of pages faulted on the DIMM with the smaller number of pages faulted is not greater than 1/16 times the number of pages faulted on the DIMM with the greater number of pages faulted OR the two DIMMs do not each have four Correctable Errors (CEs) at unique per-DIMM addresses at the same symbol position, then the bad reader/writer check FAILS.
5.1.3.3 If the bad reader/writer check SUCCEEDS, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. [Note: Determining these factors is aided by the cediag diagnostic tool set.]
5.2 Rule 5B: If more than 120 non-intermittent CEs are reported against one bit position of one AFAR in 24 hours.
6. Rule 6: For older Solaris releases and patch levels, when Solaris reports more than 24 nonintermittent CEs in 24 hours from a single DIMM. If more than one DIMM has experienced more than 24 non-intermittent CEs in 24 hours, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs.

7. Limitations: Prior to Solaris 10, retired pages are returned to service whenever a system is rebooted, and will be re-retired if and when Solaris encounters CEs from them again. POST may fail a DIMM that contained retired pages; if it does, replace the DIMM at that time.

Copyright: Sun Microsystems, Inc. Original version: Nov. 17. 2004
Updated March 16, 2006
Updated January 13, 2010 (Updated Rule 5A, added Rule 5B)
Updated March 5, 2010 (removed Rule 4B)
Updated March 11, 2010 (modified Rules 5.1.3.1 and 5.1.3.2)
Updated March 12, 2010 (spelling correction: “depending”)
Updated June 23, 2010 (Corrected typo in 5.1.2 to remove duplicated text)


cediag(1M) diagnostic tool download and reference:

When deploying the cediag tool, follow the instructions in <Document:1003867.1> Memory DIMM Replacement Tool - cediag FAQ which also provides the patches where cediag can be obtained.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate My Oracle Support Community, Oracle Sun Technologies Community.


Internal Comments

DIMM Replacement & Related Links

Quality Communications Office DIMM Directory:    https://onestop.sfbay.sun.com/qco/dimm/index_dimm.shtml
Note that any article on this directory dated prior to June 23, 2010 is from a period of time prior to the most recent DIMM Policy changes

The policy, as shown above, is the most recent version (dated June 23, 2010)
Definitions and Error Explanations <Document:1004729.1> Introduction to Solaris[TM] Operating SystemCE/UE/ECC/CBB/CBI/DBB/DBI Error Messages

Refer all questions and comments to: memory_quality_steering_committee@sun.com


NOTE: Dimms displaying consistent Mtag CE errors on the Sun Fire[TM] 12K/15K/E20K/E25K should be replaced and will not be reported on by cediag.

ARCHIVED RULES - ORIGINAL RULES FROM PRIOR POLICY [November 17, 2004, March 16, 2006] ARE BELOW

Prior rule info is saved here for reference only since CEDIAG Tool v1.3.2 for Solaris 8 & 9, and FMA for Solaris 10 are using the old/existing Rules 4 & 5 until new software patches are
released implementing the revised rules.

ARCHIVED 4.2 Rule 4B.

For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports two or more CEs from two or more different
physical addresses on each of three or more different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond to the
same relative bit position in their respective checkwords.
[Note: This means at least 6 CEs; two from one DRAM output signal, with unique addresses, two from another output from the same DRAM, also with unique addresses, and two more from yet
another output from the same DRAM, again with unique addresses, as long as the three outputs do not all correspond to the same relative bit position in their respectivecheckwords.]

ARCHIVED 5. RULE-5.

For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as forUltraSPARC II-based systems running Solaris 10 and
later, when the system indicates that the page retirement limit of 0.1% of physical memory has been reached and denotes one and only one DIMM as suspect
(i.e., it has accumulated 130 or more non-intermittent CEs). If more than one DIMM is marked as suspect, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist
before replacing any DIMMs. [Note: Determining these factors is aided by the cediag diagnostic tool set.]  In the unlikely event that the system indicates that the page retirement limit has been
reached but no DIMM is marked as suspect, contact a Sun Support specialist for assistance in determining any necessary action.

END OF ARCHIVED SECTION.

UltraSPARC, II, III, IV, IV+, DIMM, Replacement, Policy, Memory
Previously Published As 79928


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback