Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays

Asset ID:	1-77-1000369.1
Update Date:	2011-03-04
Keywords:

Solution Type Sun Alert Sure

Solution 1000369.1 : Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays

Related Items


Sun Storage 3510 FC Array
 Sun Storage 3310 Array
 Sun Storage 3511 SATA Array
 Sun Storage 3320 SCSI Array

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Data Loss
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
200491

Product
Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array

Bug Id
<SUNBUG: 6364526>

Date of Workaround Release
12-JAN-2006

Date of Resolved Release
26-APR-2006

Impact

In the "Troubleshooting" section (8.5) of the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual," (part number 816-7300-17) instructions for "Recovering From Fatal Drive Failure" are incomplete. Failure to use the correct procedure for this condition (as outlined in the "Workaround" section, step 6) may result in data integrity issues.

Note: Please also see related Sun Alert 102126 - "Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues"

Contributing Factors

This issue can occur on the following platforms:

Sun StorEdge 3310 SCSI array
Sun StorEdge 3320 SCSI array
Sun StorEdge 3510 FC array
Sun StorEdge 3511 FC array

for all current releases of controller firmware.

Existing documentation for "Recovering From Fatal Drive Failure" (section 8.5) is incomplete.

This issue refers to incomplete reconstruction of logical drives and logical drives in a "dead" state, or where the logical drive status is "FATAL FAIL," indicating that more than one drive is bad.

Symptoms

The drive(s) may assume the role of valid disk(s) in the logical drive after the reset if a reset is issued to the array before pulling the correct drive(s) in a dead logical drive. (These drives may contain stale data depending on I/O activity to the array).

Section 8.5 of the "Troubleshooting" section of service manual 816-7300-17 describes "Recovering From Fatal Drive Failure" with the following steps:

1. Discontinue all I/O activity immediately

2. Cancel the beeping alarm from the controller firmware's "Main Menu", by choosing "System Functions-Mute Beeper"

3. Physically check that all drives are firmly seated in the array and that none have been partially or completely removed.

4. Look for Status: "FATAL FAIL" (two or more failed drives).

In the firmware "Main Menu", choose "View and Edit Logical drives," and look for:

    Status: FAILED DRV (one failed drive)
    Status: FATAL FAIL (two or more failed drives)

5. Highlight the logical drive, press Return, and choose "View scsi Drives".

If two physical drives have an issue, one drive has a BAD status and one drive has a MISSING status. The MISSING status is a reminder that one of the drives might be a "false" failure (The status does not tell you which drive might be a false failure).

In the next step (6), the controller should be shut down to flush cache followed by a reset of the controller.

Note: The workaround in this Sun Alert details how to identify/pull drives based on the RAID level. These steps are not mentioned in the current documentation.

Before proceeding with Step 6, it is recommended that you pull one or more drives depending on the type of RAID the logical drive is configured with, and if the first drive failure can be determined.

6. From the "Main Menu", choose "System Functions - Shutdown Controller" and then choose "Yes" to confirm that you want to shut down the controller.

The controller can then be reset.

Workaround

A) For a logical drive configured for RAID 3 or 5, if the order of drive failure can be determined in the case of a double drive failure, then use the following steps:

1. Pull the original/first failed drive only (the first failure can be determined via the controller event log.) Also note the other bad drive(s)location.

2. Reset the controller. Use the shutdown controller menu option and choose yes when the "Reset Controller?" prompt is displayed. When the system comes back up, view the logical drives and verify that the "FATAL FAIL" has changed to "degraded".

3. If the logical drive has changed to "degraded," run fsck(1M) or equivalent.

(If the status is still "FATAL FAIL", you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive. Follow step 8 in the Troubleshooting section 8.5 for this case).

4. After the fsck(1M) completes successfully, reinsert the pulled drive OR replace with a new (good) drive if the event log indicates the drive should be replaced.

5. Rebuild the logical drive.

B) For a logical drive configured for RAID 1, if there is only one bad drive in a paired set, pull that drive and proceed to step 2. If both drives in a paired set are failed, then follow these steps:

1. Pull the original/first failed drive in each failed raid 1 pair (the first failure can be determined via the controller event log.) Also note the other bad drive(s)location.

2. Reset the controller. Use the shutdown controller menu option and choose "yes" when "Reset Controller?" prompt is displayed. When the system comes back up, view the logical drives and verify that the "FATAL FAIL" has changed to "degraded."

3. If the logical drive has changed to "degraded" run fsck(1M) or equivalent.

If the status is still "FATAL FAIL", you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive. Follow step 8 in the Troubleshooting section 8.5 for this case.

4. After the fsck(1M) completes successfully, reinsert the pulled drive OR replace with a new (good) drive if the event log indicates the drive should be replaced.

5. Rebuild the logical drive.

C) For a logical drive configured for RAID 5, if there is one missing drive, or multiple failures with only one failure in a paired set, then that drive can be replaced; otherwise, the first missing drive should be replaced only.

If it cannot be determined which drive failed first, then the array should be file system checked after the reset as there may be data inconsistencies.

Note: It is important that you check your recovered data using the application or host-based tools following a "FATAL FAIL" recovery.

Resolution

A final resolution has been completed with the updated instructions in the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" (PN: 816-7300-19 or later) at http://docs.sun.com/app/docs?q=7300-19.

Modification History
Date: 26-APR-2006

26-Apr-2006:

Updated Contributing Factors and Resolution Sections

Previously Published As
102098
Internal Comments

This issue will be resolved in firmware release around 2nd quarter 2006.

The following Sun Alerts have information about other known issues for the 3000 series products:

102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence of Drive Failures With Firmware 4.1x SMART Feature Enabled

102067 - Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays

102086 - Failed Controller Condition May Cause Data Integrity Issues

102098 - Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays

102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues

102127 - Performance Degradation Reported in Controller Firmware Releases 4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns of I/O

102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are Generated Between the Host and the SE33x0 Array

102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events

Note: One or more of the above Sun Alerts may require a Sun Spectrum Support Contract to login to a SunSolve Online account.

A new BugID (pending) is being filed against the existing "Troubleshooting" documentation for correction/addition of this information.

Note: Sun Alert 57690 also instructs to pull the drive that is reported BAD before shutting down the Controller and issuing a reset in the case of a single drive failure during LD rebuild hangs at 99%.

If cache is set to write back and there is a write operation going on at the moment of second drive failure, then there is a chance of data loss.

See Sun Alert 57690 at http://sunsolve.central.sun.com/search/document.do?assetkey=1-26-101610-1(Sun Alert 57690 for customers requires a valid Sun Service Plan with login and passwords rights).

As per the latest update to CR 4967518 where firmware doesn't track drive failure and the drive can scan back OK, this will be addressed in controller firmware version 4.14:

Modified code to support bad drive tracking. During the controller initialization process, firmware compares the config data LD ID of a USED drive with an existing one, so long as one exists, this USED drive will be marked BAD. No checking at automatic drive scan (on fibre/sata), which means a BAD drive will be scanned in as USED drive when inserting a BAD drive into system. On SCSI systems, a bad drive can be manually scanned in.

Internal Contributor/submitter
Sue.Copeland@sun.com

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
Bagher.Vahdatinia@sun.com

Internal Services Knowledge Engineer
david.mariotto@sun.com

Internal Sun Alert Kasp Legacy ID
102098

Internal Sun Alert & FAB Admin Info
Critical Category: Data Loss
Significant Change Date: 2006-01-12, 2006-04-26
Avoidance: Upgrade
Responsible Manager: null
Original Admin Info: [WF 26-Apr-2006, Dave M: docs fixed, rerelease resolved, updated CF/Res]
[WF 14-Apr-2006, Dave M: updated in anticipation of FW 4.25 release, per NWS and PTS engs]
[WF 12-Jan-2006, Dave M: OK for release]
[WF 04-Jan-2006, Dave M: final edit before sending to tech review]
[WF 03-Jan-2006, Dave M: update for exec review]
[WF 13-Dec-2005, dave m: sending for review]
[WF 12-Dec-2005, Dave M: draft created]
Product_uuid
3db30178-43d7-4d85-8bbe-551c33040f0d|Sun StorageTek 3310 SCSI Array
58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array
95288bce-56d3-11d8-9e3a-080020a9ed93|Sun StorageTek 3320 SCSI Array
9fdbb196-73a6-11d8-9e3a-080020a9ed93|Sun StorageTek 3511 SATA Array

Attachments

This solution has no attachment