Document Audience: | INTERNAL |
Document ID: | I0839-1 |
Title: | DIMM error messages on Sun Fire V880 systems with Solaris 8 Updates 5 and 6 make it difficult to identify the failing DIMM |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2004-01-07 |
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0839-1
Synopsis: DIMM error messages on Sun Fire V880 systems with Solaris 8 Updates 5 and 6 make it difficult to identify the failing DIMMCreate Date: Jun/17/02
Keywords:
DIMM error messages on Sun Fire V880 systems with Solaris 8 Updates 5 and 6 make it difficult to identify the failing DIMM
SunAlert: No
Top FIN/FCO Report: Yes
Products Reference: DIMM on Sun Fire V880
Product Category: Server / SW Admin
Product Affected:
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
Systems Affected
----------------
- A30 ALL Sun Fire V880 -
X-Options Affected
------------------
- - - - -
Parts Affected:
Part Number Description Model
----------- ----------- -----
- - -
References:
BugId: 4491362 - CE error reporting in Daktari and Cherrystone is
ambiguous.
PatchId: 108528-13 or higher: SunOS 5.8: kernel update patch.
Issue Description:
DIMM error messages on Sun Fire V880 systems running Solaris 8 U5
(07/01) or Solaris 8 U6 (10/01) without Patch 108528-13 do not provide
the exact location of the failing DIMM. This serviceability issue
could cause support personnel to replace the wrong DIMMs and lead to
unnecessary service calls and system down time.
The Sun Fire V880 platform is capable of having up to 4 dual-CPU/Memory
Boards installed into the chassis. Memory DIMMs are installed into the
CPU/Memory Boards themselves, not onto the System Board. With Solaris
8 U5 (07/01) and Solaris 8 U6 (10/01) the error messaging only reports
a DIMM connector slot (e.g. JXXXX), but not the CPU/Memory Board where
that DIMM resides. CPU/Memory Boards plug into the System Board in one
of four slots.
When an error occurs and Solaris logs the message, the service person
is given a particular DIMM slot on some CPU, but is not provided a
specific CPU Board on which to replace the DIMM. As a result, the DIMM
may be pulled from that DIMM slot on each installed CPU Board. DIMMs
may also be mistakenly pulled from the CPU Board where the "victim" CPU
is located" (the CPU which reported the event). This may not be where
the DIMM actually resides. This can cause the wrong DIMM/DIMMs to be
replaced. It is believed that this has contributed to the high rate of
NTF (No Trouble Found) for DIMMs replaced on the V880.
Here are some examples taken from a system running Solaris 8 U6, but they
will be the same for Solaris 8 U5.
Here is a single bit error that was corrected. It shows that J8000 is
the failing DIMM. However, it does not give the location of the
CPU/Memory Board on which J8000 resides.
May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 423498 kern.notice]
NOTICE:[AFT0] Corrected system bus (CE) Event on CPU7 at TL=0,
errID 0x00000052.228491d4
May 16 04:05:02 bm006 AFSR 0x00000002.000001c3 AFAR
0x00000020.c40330b0
May 16 04:05:02 bm006 Fault_PC 0x19490 Esynd 0x01c3 J8000 <------*
May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 266361 kern.notice]
[AFT0]errID 0x00000052.228491d4 Corrected Memory Error on J3200
is Intermittent
May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 466347 kern.notice]
[AFT0] errID 0x00000052.228491d4 Data Bit 72 was in error and
corrected
Here is an example of a multi-bit problem which results in a UE error.
Multiple CPUs across different CPU/Memory Boards report the failure.
The reporting CPU will not always be where the failing DIMM is located.
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 774876 kern.warning]
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU7
User Data Access at TL=0, errID 0x00000063.74c9f1d0
May 16 04:06:16 bm006 AFSR 0x00000004.000000b1 AFAR
0x00000080.c802c390
May 16 04:06:16 bm006 Fault_PC 0x19490 Esynd 0x00b1 J3100 J3101
J3201 J3200
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 820243 kern.notice]
[AFT1] errID 0x00000063.74c9f1d0 More than four Bits were in
error
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 920920 kern.warning]
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU3
User Data Access at TL=0, errID 0x00000063.74ca36f4
May 16 04:06:16 bm006 AFSR 0x00000004.000000b1 AFAR
0x00000080.c802c390
May 16 04:06:16 bm006 Fault_PC 0x19590 Esynd 0x00b1 J3100 J3101
J3201 J3200
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 774076 kern.notice]
[AFT1] errID 0x00000063.74ca36f4 More than four Bits were in
error
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 147098 kern.warning]
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU6
User Data Access at TL=0, errID 0x00000063.74ca70c4
May 16 04:06:16 bm006 AFSR 0x00000004.000000b1 AFAR
0x00000080.c802c390
May 16 04:06:16 bm006 Fault_PC 0x19490 Esynd 0x00b1 J3100 J3101
J3201 J3200
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 356796 kern.notice]
[AFT1] errID 0x00000063.74ca70c4 More than four Bits were in
error
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 679866 kern.warning]
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU2
User Data Access at TL=0, errID 0x00000063.74ca876c
May 16 04:06:16 bm006 AFSR 0x00000004.000000b1 AFAR
0x00000080.c802c390
May 16 04:06:16 bm006 Fault_PC 0x19910 Esynd 0x00b1 J3100 J3101
J3201 J3200
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 565709 kern.notice]
[AFT1] errID 0x00000063.74ca876c More than four Bits were in
error
This DIMM serviceability issue has been addressed in Solaris 8 U7
(02/02) and in Kernel PatchId 108528-13 or greater. For those customers
who cannot upgrade to the 02/02 release, it is strongly recommended
that Patch 108528-13Id or greater be installed to alleviate this issue.
Once Patch 108528-13 or Solaris 8 Update 7 is installed, the memory
error messages will include the location of the failing DIMM, including
the CPU/Memory Board. Here is an example of a DIMM error from Solaris
8 U7. Note that the CPU/Memory Board is listed in the error message.
May 23 04:01:37 cl304 SUNW,UltraSPARC-III+: [ID 375193 kern.notice]
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU0 at TL=0,
errID 0x00000043.b4f99f78
May 23 04:01:37 cl304 AFSR 0x00000002<CE>.000000f8 AFAR
0x00000040.ff431100
May 23 04:01:37 cl304 Fault_PC 0x10031120 Esynd 0x00f8 Slot B:
J2901
May 23 04:01:37 cl304 SUNW,UltraSPARC-III+: [ID 737772 kern.notice]
[AFT0] errID 0x00000043.b4f99f78 Corrected Memory Error on
Slot B: J2901 is Sticky
^^^^^^
May 23 04:01:37 cl304 SUNW,UltraSPARC-III+: [ID 594816 kern.notice]
[AFT0] errID 0x00000043.b4f99f78 Data Bit 56 was in error and
corrected
The error message above points to DIMM J2901 on the CPU/Memory Board in
Slot B.
Implementation:
---
| | MANDATORY (Fully Pro-Active)
---
---
| X | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
To aid in diagnosing DIMM failures on Sun Fire V880 systems, perform
the following actions.
NOTE: It is recommended to run Extended POST after any failing DIMMs are
replaced to verify that the memory is OK. If time permits, also
run the Memory Test from VTS.
1. Install Solaris 8 Kernel PatchId 108528-13 or greater, or upgrade the
OS to Solaris 8 Update 7 (02/02) or later.
OR
2. If it not possible to install PatchId 108528-13 or upgrade to Solaris 8
U7, use the procedure below to determine the location of failed DIMMs on
the V880 platform.
DIMM FAILURE ANALYSIS PROCEDURE
-------------------------------
A. Correctable Error (CE):
--------------------------
Use the first error messsage shown in the Problem Description, which is
a CE error. The data needed to determine where the failing DIMM is
located is the AFAR associated with the DIMM and the Memory
Configuration that OBP sets up prior to booting the System. The AFAR
can be gathered from the error message. The AFAR is
0x00000020.c40330b0 as shown below.
May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 423498 kern.notice]
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU7 at TL=0,
errID 0x00000052.228491d4
May 16 04:05:02 bm006 AFSR 0x00000002.000001c3
AFAR 0x00000020.c40330b0
^^^^^^^^^^^^^^^^^^^^^^^^
May 16 04:05:02 bm006 Fault_PC 0x19490 Esynd 0x01c3 J8000
May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 266361 kern.notice]
[AFT0] errID 0x00000052.228491d4 Corrected Memory Error on J3200
is Intermittent
May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 466347 kern.notice]
[AFT0] errID 0x00000052.228491d4 Data Bit 72 was in error and
corrected
Next, collect the Memory Configuration. There are two possible methods
to gather this information, 1. or 2 below.
1. Use 'cfgadm -av | grep "base address"' from Solaris. 'This should
be captured from the system which is producing the DIMM error. Do
NOT rely on the output below or any other system's output. Memory
configurations can vary from system to system depending on their
own unique memory layout.
The following output is displayed:
SBa::memory connected configured ok base
address 0x2000000000, 1048576 KBytes total, unconfigurable
SBb::memory connected configured ok base
address 0x4000000000, 1048576 KBytes total, unconfigurable
SBc::memory connected configured ok base
address 0X6000000000, 1048576 KBytes total, unconfigurable
SBd::memory connected configured ok base
address 0X8000000000, 1048576 KBytes total, unconfigurable
This shows the address ranges associated with each cpu/mem Module.
SBa: cpu/mem Module in Slot A address Range 0x2000000000
SBb: cpu/mem Module in Slot B address Range 0x4000000000
SBc: cpu/mem Module in Slot C address Range 0X6000000000
SBd: cpu/mem Module in Slot D address Range 0X8000000000
2. Perform the following three steps from OBP to gather the OBP
memory Configuration. The memory configuration must be captured from
the system which is producing the DIMM error. Do NOT rely on the
output below or any other system's output. Memory configurations
can vary from system to system depending on their own unique memory
layout.
Set up the OBP to gather the Memory Configuration:
ok setenv diag-switch? true
ok setenv diag-level min
ok reset-all
The system key switch could also be set to the "diag" position,
or the NVRAM variable "diag-level" could be set to "max" (ok
setenv diag-level max). The quickest way would be the 3 steps
above.
Here is output from the OBP showing the Memory Configuration:
03:57:51 Memory Configuration:
03:57:51 CPU0 Bank0 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #0
03:57:51 CPU0 Bank1 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #2
03:57:52 CPU0 Bank2 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #4
03:57:52 CPU0 Bank3 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #6
03:57:52 CPU1 Bank0 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #0
03:57:52 CPU1 Bank1 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #2
03:57:52 CPU1 Bank2 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #4
03:57:52 CPU1 Bank3 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #6
03:57:52 CPU2 Bank0 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #1
03:57:52 CPU2 Bank1 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #3
03:57:52 CPU2 Bank2 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #5
03:57:52 CPU2 Bank3 128 + 128 + 128 + 128 : 512MB @ 2000000000
8-way #7
03:57:52 CPU3 Bank0 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #1
03:57:52 CPU3 Bank1 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #3
03:57:52 CPU3 Bank2 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #5
03:57:53 CPU3 Bank3 128 + 128 + 128 + 128 : 512MB @ 4000000000
8-way #7
03:57:53 CPU4 Bank0 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #0
03:57:53 CPU4 Bank1 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #2
03:57:53 CPU4 Bank2 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #4
03:57:53 CPU4 Bank3 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #6
03:57:53 CPU5 Bank0 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #0
03:57:53 CPU5 Bank1 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #2
03:57:53 CPU5 Bank2 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #4
03:57:53 CPU5 Bank3 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #6
03:57:53 CPU6 Bank0 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #1
03:57:53 CPU6 Bank1 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #3
03:57:53 CPU6 Bank2 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #5
03:57:53 CPU6 Bank3 128 + 128 + 128 + 128 : 512MB @ 6000000000
8-way #7
03:57:53 CPU7 Bank0 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #1
03:57:54 CPU7 Bank1 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #3
03:57:54 CPU7 Bank2 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #5
03:57:54 CPU7 Bank3 128 + 128 + 128 + 128 : 512MB @ 8000000000
8-way #7
Now, all of the data has been collected to associate the failing DIMM
with the correct CPU/Memory Board.
The AFAR is 0x00000020.c40330b0. Take the two digits to the left of
the "." In this case it is 20. The value of 20 corresponds to
2000000000 which is seen from the Memory Configuration output from the
OBP. 2000000000 is associated with CPUs 0 and 2, or the CPU/Memory
Board in Slot A.
Here is how the CPUs map out to each slot on the System Board:
Slot A CPU0 & CPU2
Slot B CPU1 & CPU3
Slot C CPU4 & CPU6
Slot D CPU5 & CPU7
This information is hard coded. For example, if there was one
CPU/Memory Board located in Slot B, the CPUs would show up as 1 and 3.
B. Uncorrectable Error (UE):
----------------------------
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 920920 kern.warning]
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU3 User
Data Access at TL=0, errID 0x00000063.74ca36f4
May 16 04:06:16 bm006 AFSR 0x00000004.000000b1 AFAR
0x00000080.c802c390
May 16 04:06:16 bm006 Fault_PC 0x19590 Esynd 0x00b1 J3100 J3101
J3201 J3200
May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 774076 kern.notice]
[AFT1] errID 0x00000063.74ca36f4 More than four Bits were in
error
The AFAR above is 0x00000080.c802c390. Take the two digits to the left
of the "." In this case it is 80, so it happens that the same memory
configuration is valid. 80 corresponds to 8000000000 which relates to
CPUs 5 and 7, or the CPU/Memory Board in Slot D. Notice that the error
message was generated from CPU3. However, the failing DIMMs are not on
the same board as CPU3. They are on the CPU/Memory Board in Slot D.
Also note that this is a UE error where more than four bits were in
error. In this case you need to change all four DIMMs (J3100 J3101
J3201 J3200) located on the CPU/Memory Board in Slot D. The failure
cannot be isolated to one DIMM in this case, so all four must be
replaced.
Comments:
None
----------------------------------------------------------------------------
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------