Sun Fire T1000 and T2000 DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty.

Asset ID:	1-73-1001026.1
Update Date:	2010-08-25
Keywords:

Solution Type FAB (standard) Sure

Solution 1001026.1 : Sun Fire T1000 and T2000 DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty.

Related Items


Sun Fire T2000 Server
 Sun Fire T1000 Server

Related Categories


GCS>Sun Microsystems>Sun FAB>Standard>Reactive

PreviouslyPublishedAs
201353

Product
Sun Fire T2000 Server
Sun Fire T1000 Server

Bug Id
<SUNBUG: 6334560>

Impact

Sun Fire T1000 and T2000 DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty. There is a high fallout of DIMMs in the field because of this POST policy and a field issue with excessive DIMM returns, caused by CEs during the extended POST memory tests.

This FAB minimizes the opportunity for POST reporting memory faults that are fully and transparently handled by the PSH (Predictive Self-Healing - also known as FMA) features of Solaris.

Contributing Factors

Sun Fire T1000 and T2000 systems with firmware prior to 6.3.0 the default setting of diag_level is "max".

Symptoms

DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty.

Below is an example error message of a POST fault for a single DIMM. If this error message occured when POST was run with diag_level set to max, this is probably a case where the DIMM was flagged by POST unnecessarily.

sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

Root Cause

On the T1000/T2000, when POST encounters a single CE, the associated DIMM is declared faulty and half of system's memory is deconfigured and unavailable for Solaris. Since PSH (Predictive Self-Healing) is the primary means for detecting errors and diagnosing faults on the Niagara platforms, this policy is too aggressive (reference CR 6334560).

Workaround

The following procedures are recommended as a workaround to this issue:

1. Normal Operation

For normal operation, set diag_level to "min". POST min mode provides a sanity check to insure the system will boot. Once Solaris is up, PSH provides run time diagnosis of faults. Normal operation applies to any boot of the system except hardware upgrades or repairs (as described in section 2 below).

a) Use the ALOM command "setsc diag_level min" to set POST to min mode. Also, make sure that diag_mode is in the normal state with the ALOM command "setsc diag_mode normal". Note, that with FW prior to release 6.3.0, the default setting of diag_level is "max". Therefore, the ALOM "setdefaults" command will return POST to max mode.

b) With the FW release 6.3.0 (or later), the default setting of diag_level is min.

Note: When upgrading an existing system with FW prior to version 6.3.0, the existing POST settings will not be changed by the firmware upgrade. Therefore, if the settings have not yet been changed on the system, follow the procedure in 1.a above to change to the recommended POST settings.

c) For systems shipped with FW release 6.3.0 or later, the default setting of POST is min so no action is required, as long as there have been no changes to the default POST settings.

Note: Any faulty FRU reported by POST in min mode should be replaced. Once the FRU is replaced, follow the procedure in section 2 below "Hardware Upgrades or Repairs".

2. Hardware Upgrades or Repairs

It is recommended that POST max mode (diag_level=max) be used to validate hardware upgrades or repairs. After completing the upgrade or repair and prior to booting the system, set POST to max mode using the ALOM command "setsc diag_level max".

a) If the validation completes successfully, return POST to min mode. Use the ALOM command "setsc diag_level min".

b) If the validation does not complete successfully and POST faults a SINGLE DIMM (see example #1) that was not part of the hardware upgrade or repair, it is likely that POST has encountered a CE on the DIMM that will be handled by PSH (this can be validated by examining POST output). For this case re-enable the DIMM and re-run POST in min mode as described below:

- Reenable the DIMM via the ALOM command "enablecomponent <name of DIMM>"
- Set POST to min mode. Use the ALOM command "setsc diag_level min".
- If POST continues to fault the DIMM, it should be replaced.

For any other case, (e.g. multiple DIMMs faults (see example #2), the faulty DIMM was part of the hardware upgrade/repair, etc.) the faulty FRU(s) identified by POST should be replaced.

Note: The above procedure is not recommended following a software or firmware upgrade or any other reboot of the system that is not intended to validate a hardware change or debug a hardware problem. These boots/reboots should have POST diag_level set to min as described above under "Normal Operation".

3. POST faults reported with diag_level at max

For systems booted with diag_level at max, where it was not intended to validate a hardware upgrade or repair as described in section 2 above, any fault reported by POST should be examined to ensure that it would not have been transparently handled by Solaris PSH.

Use the following procedure to examine the fault:

a) If the FRU(s) reported by POST is not a DIMM or is more then a single DIMM, then replace the FRU(s).

b) If the FRU reported by POST is a single DIMM and the same DIMM had also been reported faulty by FMA/PSH, then replace the DIMM (see example #3).

c) If the FRU reported by POST is a single DIMM and the same DIMM had not been reported by FMA/PSH, then follow the steps in 2.b above to determine whether to replace the DIMM.

After completing this procedure, it is recommended that diag_level be set to min as described in section 1 for "Normal Operation".

Examples:

Example #1:  POST fault for a single DIMM
sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled
Example #2:  POST fault for multiple DIMMs (this example shows two DIMMs on the same channel/rank,
which in most cases is a UE)
sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled
2 OCT 13 12:47:27 MB/CMP0/CH0/R0/D1 MB/CMP0/CH0/R0/D1 deemed faulty and disabled
Example #3:  FMA fault and a POST fault on the same DIMM (the DIMM should be replaced as it
exceeded the FMA page retire threshold)
sc> showfaults -v
ID Time           FRU               Fault
0 SEP 09 11:09:26 MB/CMP0/CH0/R0/D0 Host detected fault,
MSGID:SUN4V-8000-DX UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

Previously Published As
102671
Internal Contributor/submitter
Arnold.Epstein@Sun.COM, Dencho.Kojucharov@Sun.COM, Robert.Balfour@Sun.COM

Internal Eng Business Unit Group
SSG WGS (Workgroup Systems)

Internal Eng Responsible Engineer
Steve.Trullo@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Kasp FAB Legacy ID
102671

Internal Sun Alert & FAB Admin Info
Critical Category:
Significant Change Date: 2006-10-20
Avoidance: Service Procedure
Responsible Manager: Steve.Doherty@Sun.COM
Original Admin Info: WF - Initial draft done on Oct/16
WF - published on Oct/20
WF - updated Solution section per Dencho and republished on Oct/23

Product_uuid
41b7bc41-2581-11da-99bc-080020a9ed93|Sun Fire T2000 Server
79ad78b9-961d-11d9-9adf-080020a9ed93|Sun Fire T1000 Server

Attachments

This solution has no attachment