Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events

Asset ID:	1-77-1000856.1
Update Date:	2011-03-04
Keywords:

Solution Type Sun Alert Sure

Solution 1000856.1 : Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events

Related Items


Sun Storage 3510 FC Array
 Sun Storage 3310 Array
 Sun Storage 3511 SATA Array
 Sun Storage 3320 SCSI Array

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Data Loss
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
201137

Product
Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array

Bug Id
<SUNBUG: 6346306>

Date of Workaround Release
12-JAN-2006

Date of Resolved Release
15-JUN-2006

Impact

On rare occasions, Sun StorEdge 3310, 3320, 3510 or 3511 arrays running current 4.x firmware versions, may report 1 (or, exceptionally, more than 1) disk drive as "bad" with no warning or explanation other than a "Drive Failure," "Media Scan Failed" or "Clone Failed" message. If such an event affecting a single disk drive is not noticed quickly by the system administrator, (especially in array configurations without a "hot spare" disk drive assigned), then the array is susceptible to data loss if any other disk drive is similarly reported as "bad" by the array controller.

In exceptionally rare cases, more than 1 disk drive might be reported as "bad" within a short space of time. Since most RAID configurations cannot cope with the loss of more than 1 disk from an LD (Logical Drive) within a short space of time (i.e. before reconstruction to a hot spare has completed), then more than 1 disk drive being marked as "bad" within a single LD could lead to data loss, if that LD transitions to a status of "Fatal Fail."

Contributing Factors

This issue can occur on the following platforms:

Sun StorEdge 3310 SCSI array without firmware 4.15F (as delivered in patch 113722-15)
Sun StorEdge 3320 SCSI array without firmware 4.15G (as delivered in patch 113730-01)
Sun StorEdge 3510 FC array without firmware 4.15F (as delivered in patch 113723-15)
Sun StorEdge 3511 SATA array without firmware 4.15F (as delivered in patch 113724-09)

for all current 4.x releases of controller firmware.

Note: This behavior has not been seen with earlier 3.x firmware revisions.

Symptoms

Any array controller event which shows "Drive Failure", "Media Scan Failed" or "Clone Failed" without any immediately preceding warning messages for the disk drive mentioned in that event, is an occurrence of this issue, as in the following example:

    Sun Dec 11 16:10:17 2005
    [Primary]       Notification
    NOTICE: Media Scan of CHL:2 ID:8 Completed
    Tue Dec 13 01:27:42 2005
    [Primary]       Alert
    LG:1 Logical Drive ALERT: CHL:2 ID:7  Drive Failure
    Tue Dec 13 01:27:45 2005
    [Primary]       Notification
    LG:1 Logical Drive NOTICE: Starting Rebuild
    Tue Dec 13 10:04:21 2005
    [Primary]       Notification
    Rebuild of Logical Drive 1 Completed

In the above example, there were no warning messages between Sun Dec 11 16:10:17 and Tue Dec 13 01:27:42, when the disk drive at ID:7 was reported as "Drive Failure," and there is no explanation about why the drive was marked as "bad" by the controller. In this case (as in most cases), only a single disk was reported as "Drive Failure." This array had a hot spare disk drive configured, and so the rebuild started immediately, and completed some hours later as expected.

In the worst case, a series of events like the following example may be seen, leading to data loss:

    Wed Dec  7 09:30:09 2005
    [Primary]       Notification
    On-Line Initialization of Logical Drive 1 Completed
    Thu Dec 15 13:59:48 2005
    [Primary]       Alert
    LG:0 Logical Drive ALERT: CHL:2 ID:0  Drive Failure
    Thu Dec 15 13:59:51 2005
    [Primary]       Notification
    LG:0 Logical Drive NOTICE: Starting Rebuild
    Thu Dec 15 14:00:29 2005
    [Primary]       Alert
    LG:0 Logical Drive ALERT: CHL:2 ID:2  Drive Failure
    Thu Dec 15 14:00:29 2005
    [Primary]       Alert
    LG:0 Logical Drive ALERT: Rebuild Failed

Again, there had been several days (8 days in this case) without error messages. Then, at Thu Dec 15 13:59:48, the disk drive at ID:0 is reported as "Drive Failure" without further explanation. The array had a "hot spare" disk drive configured, and the rebuild started as expected. However, at Thu Dec 15 14:00:29, a second disk drive at ID:2 is also reported as "Drive Failure". Since the rebuild onto the "hot spare" disk drive had not yet completed, when the second disk was reported as "Drive Failure", LD:0 has lost 2 disk drives (IDs 0 and 2) from its RAID-5 configuration, and hence did not have enough redundancy to continue. Host access to that LD was lost.

For comparison purposes, this "Drive Failure" event is not an example of this issue, as the events logged before the "Drive Failure" event clearly show that the disk drive had a genuine fault:

    Sun Dec  6 17:20:19 2005
    [Primary]     Warning
    CHL:2 ID:0  Drive ALERT: Drive HW Error (4C4)
    Sun Dec  6 17:20:19 2005
    [Primary]     Warning
    CHL:2 ID:0  Drive ALERT: Drive HW Error (4C4)
    Sun Dec  6 17:20:19 2005
    [Primary]       Alert
    LG:1 Logical Drive ALERT: CHL:2 ID:0  Drive Failure

Workaround

There is currently no way to predict, or prevent, a disk drive being marked as "bad" by the array controller with one of the messages "Drive Failure", "Media Scan Failed" or "Clone Failed," and with no immediately preceding messages to explain or justify why that disk drive has been marked as "bad". Therefore until further notice, treat these events as genuine disk drive failures (as some of them probably are, although others may not be).

Resolution

This issue is addressed on the following platforms:

Sun StorEdge 3310 SCSI array with firmware 4.15F (as delivered in patch 113722-15 or later)
Sun StorEdge 3320 SCSI array with firmware 4.15G (as delivered in patch 113730-01 or later)
Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)

Modification History
Date: 25-APR-2006

25-Apr-2006:

Updated Contributing Factors and Resolution sections

Date: 15-JUN-2006

15-Jun-2006:

Updated Contributing Factors and Resolution sections

References

<SUNPATCH: 113723-15>
<SUNPATCH: 113722-15>
<SUNPATCH: 113730-01>
<SUNPATCH: 113724-09>

Previously Published As
102129
Internal Comments

The patches for these firmware releases were developed across all products (all arrays impacted by these issues). Therefore, some of the patch READMEs may not reflect the BugID listed in the SunAlert, but the firmware patch listed for each product does in fact remedy the issue for the platforms specified.

The following Sun Alerts have information about other known issues for the 3000 series products:

102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence of Drive Failures With Firmware 4.1x SMART Feature Enabled

102067 - Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays

102086 - Failed Controller Condition May Cause Data Integrity Issues

102098 - Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays

102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues

102127 - Performance Degradation Reported in Controller Firmware Releases 4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns of I/O

102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are Generated Between the Host and the SE33x0 Array

102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events

Note: One or more of the above Sun Alerts may require a Sun Spectrum Support Contract to login to a SunSolve Online account.

Multiple disks being marked as "bad" in quick succession (which caused the whole LD to be lost) has been seen after what was _suspected_ to be back-end FC loop issues on an SE3510. However we cannot be sure whether this was the cause, due to sccli "diag error channel [2|3] target all" data unfortunately not being captured at relevant times, and also due to bug# 6352109 where we now know that sense key 0xB events (like FC CRC errors from disks), are not reported by the array. However due to bug# 6357118, FC CRC events on the back-end loops can cause the controller firmware to fail the drive that reports them. In other words, the disk drive which reports a back-end loop problem, may be failed (i.e. marked as "bad") by the controller firmware even though it is very unlikely that the reporting disk drive was the cause of the problem.

The combination of bugs 6352109 and 6346306 prevents us getting to true root cause on previous cases, as insufficient data was logged by the array for these events. However, if several disks are being marked as "bad" with just "Drive Failure", "Media Scan Failed" or "Clone Failed" events on an SE3510, then it is worthwhile checking to see if the FC error counters are incrementing: Use the sccli diag command above to gather a set of counters from both back-end loops 2 and 3; then use the array for a period of time (e.g. some hours or a day or two); then gather the FC error counters from both back-end loops again, and compare them to see if any are incrementing, especially the right hand CRC Error column. Please contact PTS if further advice is required about SE3510 back-end loop troubleshooting.

There is currently no way to predict, or prevent, a disk drive being marked as "bad" by the array controller with one of the messages from the synopsis ("Drive Failure", "Media Scan Failed" or "Clone Failed"), and with no immediately preceding messages to explain or justify why that disk drive has been marked as "bad". Therefore until further notice, treat these events as genuine disk drive failures (since some of them probably are, although others may not be).

The event messages from the array controller firmware will be improved in a future revision, to provide additional explanation of the reasons for a disk drive being marked as "bad".

Note: While these events cannot be prevented and cannot currently be explained, there are some best practices can help to lessen their impact in some cases, if they do occur. For example:

- Setup and configure Sun StorEdge 3000 Family Configuration Service and Sun StorEdge 3000 Family Diagnostic Reporter software to monitor each array, and to notify the system administrator quickly of any events which require attention.

- Configure at least one "spare" disk drive in each array, so that the array can automatically start a reconstruction immediately after a "Drive Failure" event.

Enabling SMART ("Self-Monitoring Analysis Reporting Technology" also called "Predictive Failure") reporting can also allow any disk drives which have started to fail, but have not done so completely, to be recognised by the array and for them to be proactively failed after cloning them to a spare disk drive in the array. This cloning functionality also minimising the performance impact which would otherwise occur, especially with a RAID-5 LD, if the array waited for the failing disk drive to completely fail and then started a full reconstruction.

However, as explained in SunAlert 102011, when SMART is enabled on an array for the first time after a period of running with it disabled, there may be one or more disk drives which had wanted to report a SMART event but were unable to do so due to SMART being disabled. Then, as soon as SMART is enabled, they do so. This can be alarming, if it happens.

Overall, SMART reporting is normally a benefit and any disk drives which log a SMART event in the array controller's event log, were genuinely going to fail at some point in the future. Identifying (either manually or automatically) and replacing any disk drives which report a SMART event before they completely fail, will help to minimise the chances of data loss in case a "Drive Failure" event for another disk drive occurs.

To enable SMART reporting and proactive cloning of disks which report a predictive failure (SMART) event on an array, using sccli:

- Set "Periodic Drive Check Time" to 30 seconds using:



     # sccli <array identifier> set drive-parameters polling-interval 30s

- Set SMART to "Detect and Perpetual Clone" using:



     # sccli <array identifier> set drive-parameters smart detect-perpetual-clone

This "detect-perpetual-clone" SMART setting is recommended because, as explained in the RAID Firmware 4.1x User's Guide:

"If the drive whose failure has been predicted continues to work successfully and another drive in the same logical drive fails, the clone drive performs as a standby spare drive and starts to rebuild the failed drive immediately. This helps prevent a fatal drive error if yet another drive fails."

However, there may be a small performance penalty once a disk drive has reported a SMART event, and the perpetual cloning operation is occurring.

For more information on SMART and the other options for that setting, see the Sun StorEdge 3000 Family RAID Firmware 4.1x User's Guide at:

http://www.sun.com/products-n-solutions/hardware/docs/html/817-3711-14/ch09_scsidrives.html#pgfId-1021002

or search for manual part number 817-3711 at http://docs.sun.com

Also see the Sun StorEdge 3000 Family CLI 2.x User's Guide at:

http://www.sun.com/products-n-solutions/hardware/docs/html/817-4951-14/04_channel.html#pgfId-1015924

The "Sun StorEdge 3000 Family CLI 2.x User's Guide" also gives general information about running sccli to administer the array.

Internal Contributor/submitter
sam.gibson@sun.com

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
sam.gibson@sun.com

Internal Services Knowledge Engineer
david.mariotto@sun.com

Internal Escalation ID
1-12530278, 1-13177848

Internal Resolution Patches
113723-15, 113722-15, 113730-01, 113724-09

Internal Sun Alert Kasp Legacy ID
102129

Internal Sun Alert & FAB Admin Info
Critical Category: Data Loss
Significant Change Date: 2006-01-12, 2006-06-15
Avoidance: Patch
Responsible Manager: sue.copeland@sun.com
Original Admin Info: [WF 15-Jun-2006, DaveM: rerelease, patches released, resolved]
[WF 12-Jun-2006, Dave M: updating for FW patch releases, rerelease when patches are published]
[WF 25-Apr-2006, Dave M: updated for patch release, republish]
[WF 14-Apr-2006, Dave M: updated in anticipation of FW 4.15F release, per NWS and PTS engs]
[WF 04-Jan-2006, Dave M: final edit before sending for tech review]
[WF 02-Jan-2006, Dave M: draft created]
Product_uuid
3db30178-43d7-4d85-8bbe-551c33040f0d|Sun StorageTek 3310 SCSI Array
58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array
95288bce-56d3-11d8-9e3a-080020a9ed93|Sun StorageTek 3320 SCSI Array
9fdbb196-73a6-11d8-9e3a-080020a9ed93|Sun StorageTek 3511 SATA Array

References

SUNPATCH:113722-15
SUNPATCH:113723-15
SUNPATCH:113724-09
SUNPATCH:113730-01

Attachments

This solution has no attachment