Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1000018.1 : Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues
PreviouslyPublishedAs 200021 Product Sun StorageTek 3310 SCSI Array Sun StorageTek 3510 FC Array Sun StorageTek 3320 SCSI Array Sun StorageTek 3511 SATA Array Bug Id <SUNBUG: 5095223> Date of Workaround Release 12-JAN-2006 Date of Resolved Release 15-JUN-2006 Impact The "Sun StorEdge 3000 Family Installation, Operation, and Service Manual - Sun StorEdge 3510 FC Array" states in Section 8.5 "Recovering From Fatal Drive Failure" that you can recover from a "Status: FATAL FAIL" (two or more failed drives) by simply resetting the controller or powering off the array. This behavior can lead to data integrity issues. Due to the current internal resource handling, all cached data (including uncommitted write data) for a logical drive is discarded if and when the logical drive enters a "FATAL FAIL" state. In the event of a fatally failed logical drive (more than 2 drive failures in a RAID 3 or 5), the current recovery process is to reset the controller, thereby causing one of the failed drives to be included back into the logical drive changing the Logical Drive state to "Degraded". If a global spare is assigned, the Logical Drive will rebuild. If a global spare is not assigned, the user can assign a spare and rebuild the logical drive. If there were incomplete write operations at the time of a drive failure, this procedure could create inconsistent data. The "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" (part number 816-7300-17) can be found on docs.sun.com at http://docs.sun.com/app/docs?q=7300-17 Note: Please also see related Sun Alert 102098 - "Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays" Contributing Factors This issue can occur on the following platforms:
for all current releases of controller firmware. Symptoms The first drive to fail in a logical disk will be persistently marked as BAD while subsequent drives that fail (while the first drive has not been fully reconstructed) will be marked as MISSING temporarily. If the multiple drives are members of the same parity group, then the owning logical device is marked as "FATAL FAIL," and any existing uncommitted write data is discarded in order to recover data cache resources. Upon reset, the array will attempt to recover MISSING drives automatically, and if possible, will restore the logical drive to "Degraded" status. The logical drive is restored, if possible, whether or not any uncommitted write data was discarded. The exposure window is mainly centered on whole site power outages that occur after the secondary drive failure, which would allow user applications to be automatically restarted coincidentally in conjunction with an array reset. This situation increases the probability that the server/application might ignore the logical drive going away and then returning with stale data. Workaround The risk of data loss can minimized by ensuring that an unused hot spare is available and/or that the first failed drive is replaced as soon as possible. This ensures that the rebuild process can start and finish as soon as possible, and reduces the exposure window as much as possible. Unmapping the logical drive while it is in the "FATAL FAIL" state should prevent any hosts from attempting to make use of the logical drive automatically after a reset. It is recommended that if the logical drive is recovered from a "FATAL FAIL" state, that the application(s) that make use of the logical drive run the appropriate data integrity verification utility before making use of the logical drive (i.e. fsck, chkdsk, etc). Note: A clean filesystem check will only guarantee the filesystem structure and does NOT guarantee user data validity. The proper use of data integrity features offered by modern databases, file systems and other applications will help ensure that user applications catch any potential data loss and can take higher level recovery procedures, thereby minimizing the effects. Resolution This issue is addressed on the following platforms:
Modification History Date: 25-APR-2006 25-Apr-2006:
Date: 15-JUN-2006 15-Jun-2006:
References<SUNPATCH: 113723-15><SUNPATCH: 113722-15> <SUNPATCH: 113730-01> <SUNPATCH: 113724-09> Previously Published As 102126 Internal Comments The patches for these firmware releases were developed across all products (all arrays impacted by these issues). Therefore, some of the patch READMEs may not reflect the BugID listed in the SunAlert, but the firmware patch listed for each product does in fact remedy the issue for the platforms specified. The recovery process of LD will become a manual process. The following Sun Alerts have information about other known issues for the 3000 series products: 102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence of Drive Failures With Firmware 4.1x SMART Feature Enabled 102067 - Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays 102086 - Failed Controller Condition May Cause Data Integrity Issues 102098 - Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays 102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues 102127 - Performance Degradation Reported in Controller Firmware Releases 4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns of I/O 102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are Generated Between the Host and the SE33x0 Array 102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events Note: One or more of the above Sun Alerts may require a Sun Spectrum Support Contract to login to a SunSolve Online account. Bug 5095223 indicates that this is invalid behavior and there needs to a manual process of recovery rather than an automatic one. Disk failed state information is not persistent across a power cycle when a logical drive fatal fail occurs. This is currently by design to allow a user to potentially recover from a spurious event that caused multiple drive failures to occur. This can be especially useful in multi-enclosure configurations where cabling errors can occur. Any single drive failure is recorded in the private region of each disk drive that is a member of a logical drive. Multiple drive failures are not recorded allowing a user to possibly recover from the failed Logical Drive with a simple reboot of the controller. Although this feature can result in potential data loss as described, it can also save the user, or field personnel, from requiring a full rebuild and restore due to common cabling mistakes. Internal Contributor/submitter sue.copeland@sun.com Internal Eng Business Unit Group NWS (Network Storage) Internal Eng Responsible Engineer Bagher.Vahdatinia@sun.com Internal Services Knowledge Engineer david.mariotto@sun.com Internal Escalation ID 1-71137216, 1-7813973 Internal Resolution Patches 113723-15, 113722-15, 113730-01, 113724-09 Internal Sun Alert Kasp Legacy ID 102126 Internal Sun Alert & FAB Admin Info Critical Category: Data Loss Significant Change Date: 2006-01-12, 2006-06-15 Avoidance: Patch, Workaround Responsible Manager: sue.copeland@sun.com Original Admin Info: [WF 12-Jun-2006, Dave M: updated for 4.15 FW, rerelase when patch is published to SS] [WF 25-Apr-2006, Dave M: update for patch release, republished] [WF 14-Apr-2006, Dave M: updated in anticipation of FW 4.15F release, per NWS and PTS engs] [WF 12-Jan-2006, Dave M: ready for release] [WF 05-Jan-2006, Dave M: review completed, Chessin changes added, all docs in this series on hold for Exec approval pending 1/12] [WF 04-Jan-2006, Dave M: final edits before sending to review] [WF 02-Jan-2006, Dave M: draft created} Product_uuid 3db30178-43d7-4d85-8bbe-551c33040f0d|Sun StorageTek 3310 SCSI Array 58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array 95288bce-56d3-11d8-9e3a-080020a9ed93|Sun StorageTek 3320 SCSI Array 9fdbb196-73a6-11d8-9e3a-080020a9ed93|Sun StorageTek 3511 SATA Array ReferencesSUNPATCH:113722-15SUNPATCH:113723-15 SUNPATCH:113724-09 SUNPATCH:113730-01 Attachments This solution has no attachment |
||||||||||||
|