Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1006649.1 : Sun StorEdge[TM] 3000 Arrays: 3.2x and 4.x Firmware Differences in Handling Media Errors
PreviouslyPublishedAs 209273 This document will clarify the behavior of the firmware in the event it encounters a bad block on a disk which is also known as a media error or an "Unrecoverable Read Error".
Applies to:Sun Storage 3310 ArraySun Storage 3320 SCSI Array Sun Storage 3510 FC Array Sun Storage 3511 SATA Array All Platforms SymptomsTo discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Storage Disk 3000 Series RAID Arrays
The following event is logged: [1113]: StorEdge Array SN#xxxx CH2 ID10: SCSI Drive ALERT: bad block encountered (02h, 03h,11/00) This is applicable for all the products under the Sun StorEdge[TM] 3000 family. ChangesAs the drive capacity increases with the increased demand, we see that since the density of the data stored also increasing, there is a greater chance of encountering bad blocks, and array vendors are using their own ways of handling the same.This document is applicable for redundant RAID implementation, primarily RAID 5, and describes how the firmware handles the bad blocks on the drives. Consider the following scenario: On a RAID 5 Logical drive: 1. One drive fails. 2. This causes the hotspare to trigger and start rebuild. 3. The rebuild finds a media error on another member drive. CauseFor StorEdge[TM] 3000 Arrays with 3.2x Firmware:The first time a bad block is encountered on a member disk while rebuild is in progress, the rebuild will fail. If we are using the serial/telnet menu when this happens, the firmware would prompt us to continue the rebuild even though there is a bad block. If we answered yes, then the rebuild would continue on to completion, provided there were no other error exceptions. For the block which has the "unrecoverable media error", the firmware zeroes out the ECC of that block and puts a special pattern there and then continues the rebuild until it completes. For StorEdge[TM] 3000 Arrays with 4.x Firmware: The firmware will automatically go ahead with the rebuild when a bad block is encountered on a member drive while rebuild is going on. Also on 4.x firmware this "specially marked bad sector of the individual disk" represents a "Logical Drive Bad Block" that will be reported when the host next tries to access that area of the Logical Drive. The event log would log the following event in case the host tried to read this block: LG:2 NOTIFY:Logical Drive BAD Block Encountered 000000200. Notice that there is no specific disk mentioned, only the Logical Drive that contains that disk. To recover from this, the host has to issue a write to that area. If we have a filesystem on this logical drive, then one option is to run fsck and see if this works. If we don't have a file system, then we should be able to locate the Logical Drive Bad Block via a dd to /dev/null. After the file/block is located, you should take the appropriate recovery steps (ie. recover from backup, re-write the data, etc.). Explanation of Controller Behavior:For the bad blocks encountered on the member drive while rebuild is undergoing, the controller erases the ECC bytes for that block so any subsequent read will result in an unrecoverable ECC error. The controller will also write a unique pattern in the block so it can be identified by the firmware as a controller generated bad block. Before this feature was implemented in 4.x, an unrecoverable media error on a surviving disk in an LD would result in a Rebuild Failure or require active intervention to allow the rebuild to continue past the bad block. SolutionCasestudy:As an example, consider the following events which are taken from a customer case. Customer is running 4.15F firmware on a StorEdge[TM] 3510 and the following messages are logged in the event logs: Wed Jul 5 14:26:13 2006 Wed Jul 5 14:26:13 2006 ... Notice that no specific drive is reporting the error so this should NOT be confused with a media error on a particular drive but a bad block on the LD and the host should also get a read error while accessing this block. We can check this by running format->analyze->read on this LD and we see.... analyze> read pass 0 Medium error during read: block 948949760 (0x388fd300) (948949760) Please note that the block number reported by the format->analyze->read is the same as the block number reported by the 3510 in the event log. To recover from this, we need to find the file residing on this block and restore that file. If the application is a database, the DBA should be able to tell us the table residing on this block and we just need to restore that table. In short, Note: The host needs to write to this block in order to make this block reusable. Typically, a drive has latent disk errors that can only be detected when the affected disk sector is accessed. These latent disk errors can be avoided if we continuously access the drives which can be accomplished by enabling media-scan to scrub the disks continuously.
[For NRAID, or RAID0, if we encounter a bad block, the LD is effectively dead and there is no way or recovering other than having the host to issue a "write" to that block, or restoring the file sitting on that bad block.
Sense Key:0x03, Sense Code:0x11, rebuild, double, drive, failure, 3510, 3310, 3320, 3511, 4.11, 4.13, 4.15, 3.25, 3.27, 4.21, firmware, bad, block, media, scan, 4.15, parity, regenerate, RAID, disk Previously Published As 85181 Change History Date: 2010-11-11 User Name: sue.copeland@sun.com Action: Currency & Update Date: 2007-11-13 User Name: 7058 Action: Approved Attachments This solution has no attachment |
||||||||||||
|