Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1000009.1 : T3/6120/6320/6920 Array firmware 3.2.4 WITHDRAWN
PreviouslyPublishedAs 200012 Product Sun StorageTek 6120 Array Sun StorageTek 6120/6320 Controller Firmware 3.2 Sun StorageTek 6320 System Sun StorageTek 6920 System Bug Id <SUNBUG: 6455175> Date of Workaround Release 17-AUG-2006 Date of Resolved Release 07-DEC-2006 Impact Upgrading a SE6x20 controller to firmware 3.2.4 may result in future drive failures that cannot be replaced. The replacement drive does not come online and remains in a degraded state. If a second drive failure occurs it may result in the loss of customer data. Contributing Factors This issue occurs in the following releases:
Note: this issue is only with firmware 3.2.4. Please see Resolution section below. Symptoms The most prevalent symptom is that a failed drive remains in a "fault disabled" state when viewed from StorADE. It becomes more evident after replacing the same drive with a new drive component. The following messages mark the original drive failure and are an indication of this issue. These messages can be found in the array syslog for T3B and 6120 arrays. They are also found in the messages.array file for the 6320 and 6920 arrays when viewed using StorADE Solution Extract utility. Example #1 Dec 23 14:08:21 ISR1[1]: N: u1d11 SCSI Disk Error Occurred (path = 0x1) Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0 Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Data Description = Failure Prediction Threshold Exceeded Dec 23 14:08:21 ISR1[1]: N: u1d11 SVD_CHECK_ERROR: prediction err: 01/5D Example #2 Jul 29 22:32:51 LPCT[1]: W: u1d03 Not present Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 5 asc 36 ascq 0 detected during copy recon. Drive is disabled Jul 29 22:34:35 ISR1[1]: W: u1d03 SCSI error occurred: Not Ready (sense key = 0x2). Logical Unit Not Ready, Initializing CMD Required. Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 2 asc 4 ascq 2 detected during copy recon. Drive is disabled Jul 29 22:34:36 MNXT[1]: N: u1d01 System area recon fail due to write error Jul 29 22:34:36 MNXT[1]: W: u1d03 could not create system area Jul 29 22:34:37 LPCT[1]: N: u1d03 Bypassed on loop 2 Jul 29 22:34:36 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) Jul 29 22:34:37 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) Jul 29 22:34:39 LPCT[1]: N: u1d03 Bypassed on loop 1 Jul 29 22:34:38 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) Jul 29 22:34:40 MNXT[2]: N: u1d03 Unable to access the drive (err = 3)
Workaround Do not upgrade to the 3.2.4 array firmware for the above arrays. For customers already at firmware 3.2.4, an array firmware downgrade to the previous revision is available, as part of the workaround and recovery of this issue. Contact Sun Services for the proper recovery procedure for your model of array. Resolution This issue is addressed in the following releases:
Note: Once the upgrade is completed if the failed drive has already been replaced, it will spin up properly when the new firmware is booted. If the failed drive has not been replaced, normal disk replacement will be successful Moving forward, although the 01/5d errors may still occur from time to time, normal supported drive replacement procedures will function as expected. Modification History Date: 18-AUG-2006
Date: 07-DEC-2006
References<SUNPATCH: 116930-06><SUNPATCH: 116931-21> Previously Published As 102579 Internal Comments
The problem is caused by the use of a Mask Bit, previously unused in the array firmware to solve an earlier bug(Bug 6366175 and Bug 6359654. See Sun Alert 102193). Under certain error conditions a "Mask Bit" in the sysarea will be set during a drive failure. Array f/w releases 3.2.3 and earlier did not utilize the Mask Bit even though it was set. Array f/w 3.2.4 now utilizes the Mask Bit. If 3.2.4 sees the Mask Bit set for a drive, that drive will be put into a fault/disabled state regardless of the actual condition. The drive may in fact be 100% functional. We know of various ways the Mask Bit can get set. They are: 1) In array f/w 3.1.x, the 01/5D Sense Key will trigger the Mask Bit. In these releases the drive fault is most likely legitimate. The drive CAN be swapped successfully and brought back to a Ready/Enabled state. However, the Mask Bit will remain set. Upon upgrading to 3.2.4, which could be weeks or months later, the problem will be triggered. 2) In array f/w 3.2.x, various BEFIT functions will also set the Mask Bit. These conditions are known to be associated with "Failure Prediction Threshold Exceeded" messages. However, in some cases, NO message appears when the Mask Bit is set. Once the Mask Bit is set, the drive will be fault/disabled. Replacement drives will remain fault/disabled. 3) In Array f/w 3.2.4, a device failed by BEFIT, legitimately, could also set the Mask Bit. Replacement drives will remain fault/disabled. Please be aware, the Mask Bit will only be set under a limited set of conditions. Field and Solution Center Engineers are required to open an escalation into PTS-STORAGE for the proper corrective action. And reference this Sun Alert. NOTE 1: Avoidance and recovery of this condition require the downgrade of the array firmware(3.2.3 for T3B/6120/6320, 3.2.2 for 6920), in addition to steps indicated by the assigned PTS engineer. Please defer to the PTS Engineer instructions prior to taking any action. NOTE 2: As stated in the bug, the other possible workaround is to initialize the sysarea of the array, effectively destroying the data, and requiring volume/vdisk/volslice rebuild and restore. As with the firmware downgrade, Field and Solution Center Engineers are required to open an escalation into PTS-STORAGE, for the proper corrective action. Internal Contributor/submitter paul.mazzarella@sun.com, Curtis.Decotis@Sun.COM, Internal Eng Business Unit Group NWS (Network Storage) Internal Eng Responsible Engineer Jim.Gou@Sun.COM Internal Services Knowledge Engineer karen.edwards@sun.com Internal Escalation ID 1-18532964,1-18718532,1-18776043,1-18893736,1-18894157 Internal Resolution Patches 116930-06, 116931-21 Internal Sun Alert Kasp Legacy ID 102579 Internal Sun Alert & FAB Admin Info Critical Category: Data Loss, Availability ==> Diagnosis, Availability ==> Regression Significant Change Date: 2006-08-17, 2006-12-07 Avoidance: Patch, Upgrade Responsible Manager: Michael.Hayward@lsil.com Original Admin Info: [WF 17-Aug-2006, karened: publishing after MUCH discussion of affected product. ] [WF 16-Aug-2006, karened: created and will send to sunalert_review] Internal SA-FAB Eng Submission -------- Original Message -------- Subject: Draft Sun Alert: Array firmware 3.2.4 WITHDRAWN Date: Wed, 16 Aug 2006 15:52:29 -0400 From: Curtis DeCotis To: Paul Mazzarella CC: steven Kent Karen, Here is the first draft of the SA that Paul and I wrote. Since I'm on the PTS review team, I took the liberty of assigning the review to myself. I have CC'd Steve Kent to alert him to this fact, as this is slightly shy of the normal process for escalations, but would be handled in a similar fashion. The information in the Internal Comments section may be better served in a new FAB. The draft submitted by Bob De Guc is a good choice for this draft, and it is my oppinion that BOTH documents be created to advise the field and customer simultaneously. If the FAB is created, we'll need to remove the content in the comments section, or add a line referencing the new FAB ID, in addition to listing the FAB in that section. Please feel free to contact me with any questions. Regards, Curtis ***Begin Draft SA**** Synopsis: T3/6120/6320/6920 Array firmware 3.2.4 WITHDRAWN Category: [ X] Availability [ ] Diagnosis [ ] HA-Failure [ ] Pervasive (reported by four or more external customers in Bug, Escalation, Radiance) [ X] Regression [ ] Severe [ ] Data Loss [ ] Security [ ] Issue is PRIVATE and is being co-ordinated with a proposed publication date of dd/mm/yy or to-be-defined [ ] Issue only known inside sun or by a friendly customer [ ] Issue is already public See Security Instructions: http://tns.central/sa/proginfo/22908.html#Section5 Product: T3B/6120/6320/6920 Array firmware BugID: 6455175 Avoidance: [ ] Workaround [ ] Binary [ ] T-Patch [ ] Patch [ ] Upgrade [ ] FCO [ ] HW [ ] None (Preliminary) State: [X] Preliminary [ ] Workaround [ ] Resolved 1. Impact: Upgrading a SE6x20 controller to firmware 3.2.4 exposes a bug that could result in future drive failures that cannot be replaced. The replacement drive will never come online and thus remain in a degraded state. A second drive failure could result in the loss of customer data. 2. Contributing Factors: Sun StorEdge T3B Arrays with Array firmware 3.2.4(patch 116930-05) Sun StorEdge 6120 Arrays with Array firmware 3.2.4(patch 116931-05) Sun StorEdge 6320 Arrays with 6020 Array firmware 3.2.4(6320 Release 1.3.2) Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.4(6920 Release 3.0.1.19) Sun StorEdge 6920 Arrays with 6020 Array firmware 3.2.4(6920 Release 3.0.1.20) 3. Symptoms: The most prevalent symptom is that a failed drive remains in a "fault disabled" state, when viewed from StorADE. It becomes more evident after replacing the same drive with a new component. The following messages mark the original drive failure, and indicate that users may run into this issue. These messages can be found in the array syslog for T3B and 6120 arrays. They are also found in the messages.array for 6320 and 6920 arrays, viewed using StorADE. Example #1 Dec 23 14:08:21 ISR1[1]: N: u1d11 SCSI Disk Error Occurred (path = 0x1) Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0 Dec 23 14:08:21 ISR1[1]: N: u1d11 Sense Data Description = Failure Prediction Threshold Exceeded Dec 23 14:08:21 ISR1[1]: N: u1d11 SVD_CHECK_ERROR: prediction err: 01/5D Example #2 Jul 29 22:32:51 LPCT[1]: W: u1d03 Not present Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 5 asc 36 ascq 0 detected during copy recon. Drive is disabled Jul 29 22:34:35 ISR1[1]: W: u1d03 SCSI error occurred: Not Ready (sense key = 0x2). Logical Unit Not Ready, Initializing CMD Required. Jul 29 22:34:35 ISR1[1]: E: Drive u1d03 Additional errors sense 2 asc 4 ascq 2 detected during copy recon. Drive is disabled Jul 29 22:34:36 MNXT[1]: N: u1d01 System area recon fail due to write error Jul 29 22:34:36 MNXT[1]: W: u1d03 could not create system area Jul 29 22:34:37 LPCT[1]: N: u1d03 Bypassed on loop 2 Jul 29 22:34:36 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) Jul 29 22:34:37 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) Jul 29 22:34:39 LPCT[1]: N: u1d03 Bypassed on loop 1 Jul 29 22:34:38 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) Jul 29 22:34:40 MNXT[2]: N: u1d03 Unable to access the drive (err = 3) 4. Relief/Workaround: Customers are advised to avoid upgrading to the 3.2.4 array firmware for the aforementioned arrays. Should you experience this issue there are two possible workarounds: A) Removing and recreating the volume containing the affected drive(s). This will destroy all data on the volume, and require data to be restored. B) Downgrade the array firmware to 3.2.3. It is recommended that you contact Sun Services for the proper procedure for your array. 5. Resolution: A final resolution is pending completion. 6. Internal Section: Escalation IDs:1-18532964,1-18718532,1-18776043,1-18893736,1-18894157 Pending Patches: Resolution Patches: FIN: FCO: Submitter: Paul Mazzarella Responsible Engineer: Responsible Manager: PTS/Engineering organization: [ ] SSG WGS (Workgroup Systems) [ ] SSG NSN (Netra Systems and Networking) [ ] SSG ES (Enterprise Systems) [ ] SSG SW (Platform Software) [ ] SSG PNP (Processor) [ ] NSG (Network Systems Group) [ X] NWS (Network Storage) [ ] OP/N1 RPE (Operating Platforms/N1 Revenue Product Engin.) [ ] JPSE (Java Platform Sustaining Engineering) [ ] JWSSE (Java Web Services Sustaining Engineering) [ ] USG (User Software Group) [ ] SSG HS (Horizontal Systems - T2000/Ontario) Distribution: [ ] Public SunSolve [ X] Contract SunSolve Comments: The problem is caused by the use of a Mask Bit, previously unused in the array firmware to solve an earlier bug(Bug 6366175 and Bug 6359654. See Sun Alert 102193). Under certain error conditions a "Mask Bit" in the sysarea will be set during a drive failure. Array f/w releases 3.2.3 and earlier did not utilize the Mask Bit even though it was set. Array f/w 3.2.4 now utilizes the Mask Bit. If 3.2.4 sees the Mask Bit set for a drive, that drive will be put into a fault/disabled state regardless of the actual condition. The drive may in fact be 100% functional. We know of various ways the Mask Bit can get set. They are: 1) In array f/w 3.1.x, the 01/5D Sense Key will trigger the Mask Bit. In these releases the drive fault is most likely legitiamate. The drive CAN be swapped successfully and brought back to a Ready/Enabled state. However, the Mask Bit will remain set. Upon upgrading to 3.2.4, which could be weeks or months later, the problem will be triggered. 2) In array f/w 3.2.x, various BEFIT functions will also set the Mask Bit. These conditions are known to be associated with "Failure Prediction Threshold Exceeded" messages. However, in some cases, NO message appears when the Mask Bit is set. Once the Mask Bit is set, the drive will be fault/disabled. Replacement drives will reamin fault/disabled. 3) In Array f/w 3.2.4, a device failed by BEFIT, legitimately, could also set the Mask Bit. Replacement drives will reamin fault/disabled. Please be aware, the Mask Bit will only be set under a limited set of conditions. For Recovery instructions, please file an escalation into PTS-STORAGE. Recovery from this issue requires a significant outage due to a firmware downgrade. PTS Reviewer (approved by): Curtis DeCotis Product_uuid 2cd2e7d2-2980-11d7-9c3f-c506fe37b7ef|Sun StorageTek 6120 Array 3ad132b5-248c-11d9-8d5c-080020a9ed93|Sun StorageTek 6120/6320 Controller Firmware 3.2 4de60cc2-a08e-4610-b8bf-6a1881cb59c6|Sun StorageTek 6320 System 67794720-356d-11d7-8ef2-ce2ac2bc9136|Sun StorageTek 6920 System ReferencesSUNPATCH:116930-06SUNPATCH:116931-21 Attachments This solution has no attachment |
||||||||||||
|