Asset ID: |
1-77-1022237.1 |
Update Date: | 2010-06-17 |
Keywords: | |
Solution Type
Sun Alert Sure
Solution
1022237.1
:
Sun Storage 7x00 2009.Q3 Software Release May Result in an Incorrect Diagnosis of CPU Correctable Error
Related Items |
- Sun Storage 7410 Unified Storage System
- Sun Storage 7110 Unified Storage System
- Sun Storage 7210 Unified Storage System
- Sun Storage 7310 Unified Storage System
|
Related Categories |
- GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
- GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
|
PreviouslyPublishedAs
278130
Bug Id
SUNBUG: 6853745
Product
Sun Storage 7000 Unified Storage System
Sun Storage 7110 Unified Storage System
Sun Storage 7210 Unified Storage System
Sun Storage 7310 Unified Storage System
Sun Storage 7410 Unified Storage System
Date of Resolved Release
02-Mar-2010
Sun Storage 7x00 2009.Q3 Software Release May Result in an Incorrect
Diagnosis of CPU Correctable Error
Impact
On the 2009.Q3 Software Release for Sun Storage
7000/7110/7210/7310/7410, a CPU may be
incorrectly diagnosed as faulty resulting in unnecessary hardware
replacement. This issue also causes
performance degradation.
Contributing Factors
This issue may occur on the following releases:
- Sun Storage Software release 2009.Q3.0.0 through 2009.Q3.4.0 for
Sun Storage 7000/7110/7210/7310/7410
Note: This issue occurs in the event of CPU correctable
errors.
To determine if you have an affected release use a browser to
connect to the appliance
management BUI on port 215, https://applianceIP:215.
Click the Sun logo on the top left.
A window will be displayed showing support data for the system. The
Operating System
version can be found near the end of the list. The version
corresponding to the above
is immediately after the "@" in the Operating System line.
Symptoms
The exact fault differs depending on the type of correctable error
received,
but will result in a fault indicating one of the cores of a CPU is
faulty. This will
generate an ASR event, alert or active problem on the system. The
message ID will
always match the form "GMCA-XXXX-XX".
The alert, log message or active problem: “a level 2 cache on this
cpu is faulty”,
can be found in the following locations in the browser interface:
Maintenance/Logs/Alert
Maintenance/Logs/System
Maintenance/Problems
Workaround
To avoid the issue until the resolution can be applied manually mark
the CPU repaired through the CLI or BUI.
To do this do the following:
* Navigate in the BUI to Maintenance:Problems
* Select the CPU fault
* Click the Mark Repaired button
Resolution
This issue is addressed in the following releases:
- Sun Storage 7x00 2009.Q3.4.1
- Sun Storage 7x00 2010.Q1.0.0 or later
Information on the above upgrades can be found at:
http://wikis.sun.com/display/FishWorks/Software+Updates
Note: If a CPU had already been incorreclty diagnosed as
faulty, it will still need to
be manually marked as repaired via CLI or BUI after the upgrade. Please
see Workaround above.
Modification History:
17-Jun-2010: Updated to include Sun Storage 7110/7210/7310/7410
Internal Comments (for SAs)
Root Cause
This is due to bad diagnosis software that is inappropriately
"replaying" correctable errors on a 10 second frequency. This
turns a single correctable error, a normally benign event, into
what appears to be a pathological problem with the CPU. This
was fixed in Solaris by the following CR:
6853745 Same ereport is generated every 10 seconds automatically ...
This issue has been resolved by pulling in the above CR into
the 2009.Q3.4.1 release, and the fix is already present in the
upcoming 2010.Q1 release.
For Support personnel:
To distinguish between this false diagnosis and a truly bad CPU,
the following steps must be taken:
1. The customer must be running a software release between 2009.Q3.0.0
and 2009.Q3.4.0.
2. The fault must have a message ID of the form "GMCA-XXXX-XX".
3. Take the UUID of the fault (found in the active problems page)
and run the following command from the Solaris shell:
fmdump -V -u <uuid> -e | \
egrep 'ereport|IA32_MCi_STATUS|IA32_MCi_ADDR'
This command can also be run against a support bundle by going
to the 'fm' directory and running the above command with
'fltlog' at the end of the fmdump command (before the pipe).
4. Determine if the output consists of the same CPU ereport replayed
every 10 seconds. An example of a bad diagnosis is:
Jan 25 2010 01:14:21.506739770 ereport.cpu.generic-x86.l2cache
class = ereport.cpu.generic-x86.l2cache
IA32_MCi_STATUS = 0x940001000000010a
IA32_MCi_ADDR = 0xa07c900
Jan 25 2010 01:14:11.506975918 ereport.cpu.generic-x86.l2cache
class = ereport.cpu.generic-x86.l2cache
IA32_MCi_STATUS = 0x940001000000010a
IA32_MCi_ADDR = 0xa07c900
Jan 25 2010 01:14:01.507217171 ereport.cpu.generic-x86.l2cache
class = ereport.cpu.generic-x86.l2cache
IA32_MCi_STATUS = 0x940001000000010a
IA32_MCi_ADDR = 0xa07c900
Note that the ereports appear approximately every 10 seconds, and
they contain identical payloads.
If the customer is running different software, or the ereports do
not match the above pathology, the CPU is truly faulty and must be
replaced.
keywords: amber
Internal Contributor/submitter
Renee.Bennett@sun.com
Internal Eng Responsible Engineer
Eric.Schrock@sun.com
Internal Services Knowledge Engineer
karen.edwards@sun.com
Internal Eng Business Unit Group
Attachments
This solution has no attachment