Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1000009.1
Update Date:2011-02-25
Keywords:

Solution Type  Sun Alert Sure

Solution  1000009.1 :   T3/6120/6320/6920 Array firmware 3.2.4 WITHDRAWN  


Related Items
  • Sun Storage 6320 System
  •  
  • Sun Storage 6120 Array
  •  
  • Sun Storage 6920 System
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Data Loss
  •  

PreviouslyPublishedAs
200012


Product
Sun StorageTek 6120 Array
Sun StorageTek 6120/6320 Controller Firmware 3.2
Sun StorageTek 6320 System
Sun StorageTek 6920 System

Bug Id
<SUNBUG: 6455175>

Date of Workaround Release
17-AUG-2006

Date of Resolved Release
07-DEC-2006

Impact

Upgrading a SE6x20 controller to firmware 3.2.4 may result in future drive failures that cannot be replaced. The replacement drive does not come online and remains in a degraded state. If a second drive failure occurs it may result in the loss of customer data.


Contributing Factors

This issue occurs in the following releases:

  • Sun StorEdge T3B Arrays with Array firmware 3.2.4 (as delivered in patch 116930-05)
  • Sun StorEdge 6120 Arrays with Array firmware 3.2.4 (as delivered in patch 116931-20)
  • Sun StorEdge 6320 Arrays with 6020 Array firmware 3.2.4 (6320 Release 1.3.2)
  • Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.4 (6920 Release 3.0.1.19)
  • Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.4 (6920 Release 3.0.1.20)

Note: this issue is only with firmware 3.2.4. Please see Resolution section below.


Symptoms

The most prevalent symptom is that a failed drive remains in a "fault disabled" state when viewed from StorADE. It becomes more evident after replacing the same drive with a new drive component.

The following messages mark the original drive failure and are an indication of this issue. These messages can be found in the array syslog for T3B and 6120 arrays. They are also found in the messages.array file for the 6320 and 6920 arrays when viewed using StorADE Solution Extract utility.

Example #1
Dec 23 14:08:21 ISR1[1]: N: u1d11 SCSI Disk Error Occurred (path = 0x1)
Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Data Description = Failure
Prediction Threshold Exceeded
Dec 23 14:08:21 ISR1[1]: N: u1d11 SVD_CHECK_ERROR: prediction err: 01/5D
Example #2
Jul 29 22:32:51 LPCT[1]: W: u1d03 Not present
Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 5 asc 36
ascq 0 detected during copy recon. Drive is disabled
Jul 29 22:34:35 ISR1[1]: W: u1d03 SCSI error occurred: Not Ready (sense
key = 0x2). Logical Unit Not Ready, Initializing CMD Required.
Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 2 asc 4 ascq
2 detected during copy recon. Drive is disabled
Jul 29 22:34:36 MNXT[1]: N: u1d01 System area recon fail due to write error
Jul 29 22:34:36 MNXT[1]: W: u1d03 could not create system area
Jul 29 22:34:37 LPCT[1]: N: u1d03 Bypassed on loop 2
Jul 29 22:34:36 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Jul 29 22:34:37 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Jul 29 22:34:39 LPCT[1]: N: u1d03 Bypassed on loop 1
Jul 29 22:34:38 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Jul 29 22:34:40 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)

 


Workaround

Do not upgrade to the 3.2.4 array firmware for the above arrays.

For customers already at firmware 3.2.4, an array firmware downgrade to the previous revision is available, as part of the workaround and recovery of this issue. Contact Sun Services for the proper recovery procedure for your model of array.


Resolution

This issue is addressed in the following releases:

  • Sun StorEdge T3B Arrays with Array firmware 3.2.5 (as delivered in patch 116930-06 or later)
  • Sun StorEdge 6120 Arrays with Array firmware 3.2.5 (as delivered in patch 116931-21 or later)
  • Sun StorEdge 6320 Arrays with 6020 Array firmware 3.2.5
  • Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.5
  • Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.5

Note: Once the upgrade is completed if the failed drive has already been replaced, it will spin up properly when the new firmware is booted. If the failed drive has not been replaced, normal disk replacement will be successful

Moving forward, although the 01/5d errors may still occur from time to time, normal supported drive replacement procedures will function as expected.



Modification History
Date: 18-AUG-2006
  • Updated Relief/Workaround section

Date: 07-DEC-2006
  • Updated Resolution section.
  • State: Resolved


References

<SUNPATCH: 116930-06>
<SUNPATCH: 116931-21>

Previously Published As
102579
Internal Comments



The problem in v3.2.4 will affect both T4 and T3B.
The fix in v3.2.5 should cover both of them too.
   Jim Gou - Engenio Storage Group - LSI Logic
   Sun Office phone: (510)936-2727


The problem is caused by the use of a Mask Bit, previously unused in the array firmware to solve an earlier bug(Bug 6366175 and Bug 6359654. See Sun Alert 102193). Under certain error conditions a "Mask Bit" in the sysarea will be set during a drive failure. Array f/w releases 3.2.3 and earlier did not utilize the Mask Bit even though it was set.



Array f/w 3.2.4 now utilizes the Mask Bit. If 3.2.4 sees the Mask Bit set for a drive, that drive will be put into a fault/disabled state regardless of the actual condition.



The drive may in fact be 100% functional.



We know of various ways the Mask Bit can get set. They are:



1) In array f/w 3.1.x, the 01/5D Sense Key will trigger the Mask Bit. In these releases the drive fault is most likely legitimate. The drive CAN be swapped successfully and brought back to a Ready/Enabled state. However, the Mask Bit will remain set. Upon upgrading to 3.2.4, which could be weeks or months later, the problem will be triggered.



2) In array f/w 3.2.x, various BEFIT functions will also set the Mask Bit. These conditions are known to be associated with "Failure Prediction Threshold Exceeded" messages. However, in some cases, NO message appears when the Mask Bit is set.



Once the Mask Bit is set, the drive will be fault/disabled. Replacement drives will remain fault/disabled.



3) In Array f/w 3.2.4, a device failed by BEFIT, legitimately, could also set the Mask Bit. Replacement drives will remain fault/disabled. Please be aware, the Mask Bit will only be set under a limited set of conditions.



Field and Solution Center Engineers are required to open an escalation into PTS-STORAGE for the proper corrective action. And reference this Sun Alert.



NOTE 1: Avoidance and recovery of this condition require the downgrade of the array firmware(3.2.3 for T3B/6120/6320, 3.2.2 for 6920), in addition to steps indicated by the assigned PTS engineer. Please defer to the PTS Engineer instructions prior to taking any action.



NOTE 2: As stated in the bug, the other possible workaround is to initialize the sysarea of the array, effectively destroying the data, and requiring volume/vdisk/volslice rebuild and restore. As with the firmware downgrade, Field and Solution Center Engineers are required to open an escalation into PTS-STORAGE, for the proper corrective action.


Internal Contributor/submitter
paul.mazzarella@sun.com, Curtis.Decotis@Sun.COM,

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
Jim.Gou@Sun.COM

Internal Services Knowledge Engineer
karen.edwards@sun.com

Internal Escalation ID
1-18532964,1-18718532,1-18776043,1-18893736,1-18894157

Internal Resolution Patches
116930-06, 116931-21

Internal Sun Alert Kasp Legacy ID
102579

Internal Sun Alert & FAB Admin Info
Critical Category: Data Loss, Availability ==> Diagnosis, Availability ==> Regression
Significant Change Date: 2006-08-17, 2006-12-07
Avoidance: Patch, Upgrade
Responsible Manager: Michael.Hayward@lsil.com
Original Admin Info: [WF 17-Aug-2006, karened: publishing after MUCH discussion of affected product. ]

[WF 16-Aug-2006, karened: created and will send to sunalert_review]

Internal SA-FAB Eng Submission
-------- Original Message --------
Subject: Draft Sun Alert: Array firmware 3.2.4 WITHDRAWN
Date: Wed, 16 Aug 2006 15:52:29 -0400
From: Curtis DeCotis
To: Paul Mazzarella , Karen Edwards , Pete Babich
CC: steven Kent , david treen , Bob De Guc

Karen,

Here is the first draft of the SA that Paul and I wrote. Since I'm on
the PTS review team, I took the liberty of assigning the review to myself.

I have CC'd Steve Kent to alert him to this fact, as this is slightly
shy of the normal process for escalations, but would be handled in a
similar
fashion.

The information in the Internal Comments section may be better served in a
new FAB. The draft submitted by Bob De Guc is a good choice for this draft,
and it is my oppinion that BOTH documents be created to advise the field and
customer simultaneously.

If the FAB is created, we'll need to remove the content in the comments
section,
or add a line referencing the new FAB ID, in addition to listing the FAB
in that section.


Please feel free to contact me with any questions.

Regards,

Curtis


***Begin Draft SA****



Synopsis: T3/6120/6320/6920 Array firmware 3.2.4 WITHDRAWN

Category: [ X] Availability

[ ] Diagnosis
[ ] HA-Failure
[ ] Pervasive (reported by four or more external customers
in Bug, Escalation, Radiance)
[ X] Regression
[ ] Severe

[ ] Data Loss

[ ] Security

[ ] Issue is PRIVATE and is being co-ordinated with a
proposed publication date of dd/mm/yy or to-be-defined

[ ] Issue only known inside sun or by a friendly customer

[ ] Issue is already public

See Security Instructions:
http://tns.central/sa/proginfo/22908.html#Section5

Product:

T3B/6120/6320/6920 Array firmware


BugID: 6455175

Avoidance: [ ] Workaround
[ ] Binary
[ ] T-Patch
[ ] Patch
[ ] Upgrade
[ ] FCO
[ ] HW
[ ] None (Preliminary)

State: [X] Preliminary
[ ] Workaround
[ ] Resolved


1. Impact:

Upgrading a SE6x20 controller to firmware 3.2.4 exposes a bug that could
result
in future drive failures that cannot be replaced. The replacement drive
will
never come online and thus remain in a degraded state. A second drive
failure
could result in the loss of customer data.


2. Contributing Factors:

Sun StorEdge T3B Arrays with Array firmware 3.2.4(patch 116930-05)
Sun StorEdge 6120 Arrays with Array firmware 3.2.4(patch 116931-05)
Sun StorEdge 6320 Arrays with 6020 Array firmware 3.2.4(6320 Release 1.3.2)
Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.4(6920 Release
3.0.1.19)
Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.4(6920 Release
3.0.1.20)

3. Symptoms:

The most prevalent symptom is that a failed drive remains in a "fault
disabled" state,
when viewed from StorADE. It becomes more evident after replacing the
same drive
with a new component.

The following messages mark the original drive failure, and indicate that
users may run into this issue. These messages can be found in the array
syslog
for T3B and 6120 arrays. They are also found in the messages.array for
6320 and
6920 arrays, viewed using StorADE.

Example #1

Dec 23 14:08:21 ISR1[1]: N: u1d11 SCSI Disk Error Occurred (path = 0x1)
Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Data Description = Failure
Prediction Threshold Exceeded
Dec 23 14:08:21 ISR1[1]: N: u1d11 SVD_CHECK_ERROR: prediction err: 01/5D

Example #2

Jul 29 22:32:51 LPCT[1]: W: u1d03 Not present
Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 5 asc 36
ascq 0 detected during copy recon. Drive is disabled
Jul 29 22:34:35 ISR1[1]: W: u1d03 SCSI error occurred: Not Ready (sense
key = 0x2). Logical Unit Not Ready, Initializing CMD Required.
Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 2 asc 4 ascq
2 detected during copy recon. Drive is disabled
Jul 29 22:34:36 MNXT[1]: N: u1d01 System area recon fail due to write error
Jul 29 22:34:36 MNXT[1]: W: u1d03 could not create system area
Jul 29 22:34:37 LPCT[1]: N: u1d03 Bypassed on loop 2
Jul 29 22:34:36 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Jul 29 22:34:37 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Jul 29 22:34:39 LPCT[1]: N: u1d03 Bypassed on loop 1
Jul 29 22:34:38 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Jul 29 22:34:40 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)


4. Relief/Workaround:

Customers are advised to avoid upgrading to the 3.2.4 array firmware for
the aforementioned arrays.

Should you experience this issue there are two possible workarounds:

A) Removing and recreating the volume containing the affected drive(s).
This
will destroy all data on the volume, and require data to be restored.

B) Downgrade the array firmware to 3.2.3. It is recommended that you
contact
Sun Services for the proper procedure for your array.

5. Resolution:

A final resolution is pending completion.

6. Internal Section:

Escalation IDs:1-18532964,1-18718532,1-18776043,1-18893736,1-18894157
Pending Patches:
Resolution Patches:
FIN:
FCO:
Submitter: Paul Mazzarella
Responsible Engineer:
Responsible Manager:
PTS/Engineering organization:

[ ] SSG WGS (Workgroup Systems)
[ ] SSG NSN (Netra Systems and Networking)
[ ] SSG ES (Enterprise Systems)
[ ] SSG SW (Platform Software)
[ ] SSG PNP (Processor)
[ ] NSG (Network Systems Group)
[ X] NWS (Network Storage)
[ ] OP/N1 RPE (Operating Platforms/N1 Revenue Product Engin.)
[ ] JPSE (Java Platform Sustaining Engineering)
[ ] JWSSE (Java Web Services Sustaining Engineering)
[ ] USG (User Software Group)
[ ] SSG HS (Horizontal Systems - T2000/Ontario)

Distribution: [ ] Public SunSolve
[ X] Contract SunSolve


Comments:

The problem is caused by the use of a Mask Bit, previously unused in the
array firmware
to solve an earlier bug(Bug 6366175 and Bug 6359654. See Sun Alert
102193). Under
certain error conditions a "Mask Bit" in the sysarea will be set during
a drive failure.
Array f/w releases 3.2.3 and earlier did not utilize the Mask Bit even
though it was set.
Array f/w 3.2.4 now utilizes the Mask Bit. If 3.2.4 sees the Mask Bit
set for a drive,
that drive will be put into a fault/disabled state regardless of the
actual condition.
The drive may in fact be 100% functional.

We know of various ways the Mask Bit can get set. They are:

1) In array f/w 3.1.x, the 01/5D Sense Key will trigger the Mask Bit. In
these releases
the drive fault is most likely legitiamate. The drive CAN be swapped
successfully and
brought back to a Ready/Enabled state. However, the Mask Bit will
remain set. Upon
upgrading to 3.2.4, which could be weeks or months later, the problem
will be triggered.

2) In array f/w 3.2.x, various BEFIT functions will also set the Mask
Bit. These
conditions are known to be associated with "Failure Prediction Threshold
Exceeded"
messages. However, in some cases, NO message appears when the Mask Bit
is set.
Once the Mask Bit is set, the drive will be fault/disabled. Replacement
drives
will reamin fault/disabled.

3) In Array f/w 3.2.4, a device failed by BEFIT, legitimately, could
also set the Mask
Bit. Replacement drives will reamin fault/disabled. Please be aware, the
Mask Bit will
only be set under a limited set of conditions.

For Recovery instructions, please file an escalation into PTS-STORAGE.
Recovery from
this issue requires a significant outage due to a firmware downgrade.



PTS Reviewer (approved by): Curtis DeCotis
Product_uuid
2cd2e7d2-2980-11d7-9c3f-c506fe37b7ef|Sun StorageTek 6120 Array
3ad132b5-248c-11d9-8d5c-080020a9ed93|Sun StorageTek 6120/6320 Controller Firmware 3.2
4de60cc2-a08e-4610-b8bf-6a1881cb59c6|Sun StorageTek 6320 System
67794720-356d-11d7-8ef2-ce2ac2bc9136|Sun StorageTek 6920 System

References

SUNPATCH:116930-06
SUNPATCH:116931-21

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback