Document Audience: | INTERNAL |
Document ID: | I0876-3 |
Title: | Patch 112276-06 (Firmware 2.01.03) and later for Sun StorEdge T3+ (T3B) Arrays resolves several disk error handling issues. SunAlert: Yes |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2003-06-27 |
---------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0876-3
Synopsis: Patch 112276-06 (Firmware 2.01.03) and later for Sun StorEdge T3+ (T3B) Arrays resolves several disk error handling issues. SunAlert: YesCreate Date: Jun/27/03
SunAlert: Yes
Top FIN/FCO Report: Yes
Products Reference: Sun StorEdge T3+/3910/3960/6910/6960
Product Category: StorEdge / SW Admin
Product Affected:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- ANYSYS - System Platform Independent -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- T3+ ALL T3+ StorEdge Array -
- 3910 ALL Sun StorEdge 3910 Array -
- 3960 ALL Sun StorEdge 3960 Array -
- 6910 ALL Sun StorEdge 6910 Array -
- 6960 ALL Sun StorEdge 6960 Array -
Parts Affected:
Part Number Description Model
----------- ----------- -----
- - -
References:
BugId: 4697868 - disk in raid 5 on T3+ failed and Oracle database
crashed in clustered config.
4707617 - Unrecovered Read Error during vol verify fix
operation not corrected
FIN: I0936-1 - Special pre-installation procedures are required
to prevent loss of volume access with F/W Update
patch 109115-12 (FW 1.18.1) on Sun StorEdge
T3 (T3A) Arrays.
I0966-1 - Best practices guidelines are available for
StorEdge T3/T3+ arrays which encounter
"disk error 03" messages.
PatchId: 112276-06 - T3+ 2.01.03: System Firmware Update.
112276-07 - T3+ 2.01.03: System Firmware Update.
Sun Alert: 52562 - Special Firmware Installation Procedures Are
Required to Prevent Loss of Volume Access on
StorEdge T3/T3+ Arrays.
Issue Description:
-------------------------------------------------------------------------
| CHANGE HISTORY |
| ============== |
| |
| FIN I0876-3 from I0876-2 |
| |
| DATE MODIFIED: May 21, 2003 |
| |
| UPDATES: PROBLEM DESCRIPTION, CORRECTIVE ACTION |
| |
| PROBLEM DESCRIPTION: |
| -------------------- |
| . NOTE on the purpose of this FIN has been added |
| . Modified entire Problem Description section to |
| include new features 'vol verify fix' & others |
| . Removed SPECIAL NOTE section to generate |
| a seperate FIN to address its contents |
| . Revised WARNING section |
| |
| |
| CORRECTIVE ACTION: |
| ------------------ |
| . Revised entire corrective action section |
| . Added Post-Install instructions related to |
| 'vol verify fix' command |
| |
-------------------------------------------------------------------------
NOTE: Due to the significant time and effort that may be required or
perhaps completely avoided, the subject matter contained in this
FIN MUST be read and understood from beginning to end.
Sun StorEdge T3+ arrays with firmware versions prior to 2.01.03 may be
susceptible to loss of availability. This situation can occur when
certain disk errors are encountered.
Depending upon the
1) configuration,
2) application,
3) type of volume manager,
in use, the host may
1) continue the retry read/write operations,
2) unmount the volume,
3) cause the application to timeout.
If a disk drive experiences one of the following errors,
"Sense Key = 0x4",
"Sense Key = 0x01, Asc = 0x5d"
the T3+ will repeatedly retry read/write operations. This may appear to
the host as if the T3+ is not responding.
This issue has been resolved with firmware patch 112276-06 and later.
With firmware version 2.01.03 and above, the error handling of these
particular disk errors has been enhanced to appropriately disable the
affected drive.
If more than one disk in a given RAID volume reports these particular
errors, then ALL of the drives that report these errors WILL be
disabled. This will result in the volume being unmounted and will cause
a loss of access to data.
To avoid the potential loss of access to data, special pre-installation
and post-installation procedures are required with this patch and are
detailed below.
>>>>> WARNING! <<<<<
Patch 112276-06 is for the T3+ (T3B) ONLY. Do not install this patch on
a T3 (T3A). Use the "ver" command to see if you have a T3+ (T3B), as
shown below.
hws27-41:/:<8>ver
T3B Release 2.01.01 2002/07/30 19:16:42 (10.4.27.41)
Copyright (C) 1997-2001 Sun Microsystems, Inc.
All Rights Reserved.
Plan to allocate the necessary time and effort for the entire upgrade
process based on the need to complete the following non-trivial tasks:
1) Backup all volume data
2) Review all available syslogs for drive errors
3) Run the 'vol verify' procedures and any standard corrective
actions
4) Replace all failed drives
5) a) Upgrade all T3+ arrays
b) Upgrade 3910/3960/6910/6960 SP images, including all T3+ arrays
6) Run the 'vol verify fix' procedures
Without first reviewing the T3+ syslog file(s) for possible drive errors
and then taking the necessary pro-active** action, installing patch
112276 may result in drives becoming disabled which can lead to a loss
of volume access. If prior to the installation of patch 112276-06, any
drives have exhibited the errors listed in this FIN, those drives WILL be
disabled when patch 112276-06 or newer is installed.
**Pro-active means having spare disks available and immediately replacing
drives that have the errors listed in this document, prior to installing
patch 112276-06.
>>>>> END WARNING <<<<<<<
The affected systems, listed in the 'PRODUCTS AFFECTED:' section above,
include any StorEdge T3+ (T3B) array that does not have firmware version
2.01.03 or above. This firmware is available in patch 112276-06. See
FIN I0936-1 for a similar issue with the T3 (T3A) array.
For T3+ arrays with firmware versions lower than 2.01.03, the T3+ syslog
file may show multiple error messages of the following types:
A. More than one "Sense Key = 0x4" error on one specific drive.
Example:
Jun 05 06:16:14 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
Jun 05 06:16:14 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
Jun 06 08:36:19 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
Jun 06 08:36:19 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
AND/OR
B. A single "Sense Key = 0x1, Asc = 0x5d" error on one specific drive.
Example:
Jul 31 16:19:22 ISR1[1]: N: u1d3 SCSI Disk Error Occurred (path = 0x1)
Jul 31 16:19:22 ISR1[1]: N: Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
Jul 31 16:19:22 ISR1[1]: N: Sense Data Description = Failure Prediction
Threshold Exceeded
NOTE: Patch 112276-06 provides enhancements to the 'vol verify' and
'vol verify fix' commands as described below:
1. Previously, the 'vol verify' command terminated at the first
occurrence of a disk error. The code has been modified to scan
the whole volume for any errors or parity mismatches, even when disk
error of type 'Media error' (Sense Key = 0x3, Asc =0x11, Ascq = 0x0)
is encountered.
2. Previously, the 'vol verify fix' command terminated at the first
occurrence of a disk 'Media error'. The code has been modified to
regenerate the valid data from other disks in the volume, whenever
any one disk in a given volume encounters a disk 'Media error'
(Sense key = 0x3, ASC = 0x11). This is done by performing an alternate
stripe operation to construct the good data from other drives, writing
it back to the bad block, and then letting the disk perform an auto
reallocation. If it is not possible to correct the error, the drive
is marked as "failed" and the 'vol verify fix' command will terminate
at that point. Otherwise, it will continue to scan the entire volume.
Sample disk media errors:
Feb 10 02:37:49 ISR1[1]: W: u1d8 SCSI Disk Error Occurred (path = 0x0)
Feb 10 02:37:49 ISR1[1]: W: Sense Key = 0x3, Asc = 0x11, Ascq = 0x0
Feb 10 02:37:49 ISR1[1]: W: Sense Data Description = Unrecovered Read Error
Feb 10 02:37:49 ISR1[1]: W: Valid Information = 0x12257ea
Implementation:
---
| | MANDATORY (Fully Proactive)
---
---
| X | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned issue.
1. For All T3+ (T3B) arrays:
. Follow the pre-installation instructions detailed below.
. Install patch 112276-06 or later and strictly follow the procedures
listed in the 'Patch Installation Instructions' section of the patch
document.
Follow the post-installation instructions detailed below.
2. For All 3910/3960/6910/6960:
. Login to SP and type: "cat /etc/motd".
. Upgrade the Service Processor Image to rev 2.3.1 or above by accessing
the image and following the README file for upgrade information at:
http://edist.central
http://futureworld.central/WSTL/PROJECTS/SPImage/Src/web/Downloads.shtml
. Once the Service Processor upgrade is completed, the T3+ controller
firmware must be upgraded as a separate process. This will be patch
112276-06 or later, and will be contained in the particular SP image.
. Follow the pre-installation instructions detailed below.
. Install patch 112276-06 or later and strictly follow the procedures
listed in the 'Patch Installation Instructions' section of the patch
document. The procedures are also explained in the image README file
and in the 'Sun StorEdge(tm) 3900 and 6900 Series' Reference and
Service Manuals.
. Follow the post-installation instructions detailed below.
I. PATCH PRE-INSTALL INSTRUCTIONS: (T3+/3910/3960/6910/6960)
----------------------------------
1. ftp the 'syslog' file from the T3+ where the firmware patch will
be installed.
2. Save this 'syslog' file to a local directory on the host system and
run the following command:
% egrep -i
'0x5D|Threshold|0x15|0x4|Mechanical|Positioning|Exceeded|Disk Error' syslog
(This search command can be modified if site requirements are varied)
3. There is a chance that more than one disk will have these error codes
in the syslog. If this is the case, take a backup of all the files
residing on the volume/slices. These errors are fatal but may still
allow some i/o requests to continue. This is an emergency situation
because the volume may not be available due to the presence of these
errors. Disks with 0x1/0x5d and 0x4/32 errors must be replaced because
they are about to fail or have already failed. If this is the case,
a Volume backup may fail because of dual disk failure.
4. After the backup, the volumes should be recreated and reinitialized
before restoring the data from the backup. This will reassign all bad
blocks from the volumes.
5. If there is a situation, although highly unadvised, where a back-up
cannot be taken, the 'vol verify' command should continue to be run
until it fully runs to completion.
6. An alternate solution to the 'vol verify' command is described in
the 'Work Around:' section of BugID 4707617.
7. Ensure the volume is now in an optimal working state without any
drives disabled and then continue with the patch install.
Error Examples:
Here 'u2d5' and 'u1d3' shows the location of drives.
test_host% egrep -i
'0x5D|Threshold|0x15|0x4|Mechanical|Positioning|Exceeded|Disk Error'
syslog
Jun 05 06:16:14 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
Jun 05 06:16:14 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
Jun 05 06:16:14 ISR1[2]: W: Sense Data Description = Mechanical
Positioning Error
Jun 06 08:36:19 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
Jun 06 08:36:19 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
Jun 06 08:36:19 ISR1[2]: W: Sense Data Description = Mechanical
Positioning Error
AND/OR
Jul 31 16:19:22 ISR1[1]: N: u1d3 SCSI Disk Error Occurred (path = 0x1)
Jul 31 16:19:22 ISR1[1]: N: Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
Jul 31 16:19:22 ISR1[1]: N: Sense Data Description = Failure Prediction
Threshold Exceeded
II. PATCH INSTALL:
------------------
Based on the 'CORRECTIVE ACTION:' section listed above, download and
install T3+ patch 112276-06 or the 3910/3960/6910/6960 SP image with
the bundled T3+ patch 112276-06.
NOTE: The patch README explains the pre-install section for specifics.
It maybe helpful to review the entire process; such as; find
patch online, download patch, unpack patch, and then review the
different docs provided with all the patches.
III. PATCH POST-INSTALL INSTRUCTIONS:(T3+/3910/3960/6910/6960)
-------------------------------------
1. Run the 'vol verify fix' command and any standard corrective actions.
Refer to the release notes for v2.01.03 or BugID 4707617 for
additional reference.
2. It will be essential to continue to run the 'vol verify fix'
procedure EVERY 30 days to maintain the health of your drives.
This must become an integral part of the ongoing maintenance
necessary for the continued reliabilty and accessability for the
storage arrays. Failure to do so will continue to raise the
unjustified costs associated with the high rate of drives being
replaced and then determined to be NTF (No Trouble Found).
III. PATCH POST-INSTALL INSTRUCTIONS:(T3+/3910/3960/6910/6960)
-------------------------------------
Note: As is always the case, the use of RAID protected disk subsystems does
not eliminate the need for regular, verified data backups.
Any disk RAID subsystem can survive only a certain number of failures. This
depends on the RAID level and other factors, before valid data is no longer
available. A RAID subsystem like a T3+, (with a redundant RAID level other
than RAID-0), is designed to survive any typical single failures while
still being able to supply valid data to the host. However, RAID
subsystems are not designed to survive all cases of two failures being
present at the same time, even if those two failures result from different
causes and the initial failures occurred at different times.
Therefore it is necessary to quickly identify that there is only a single
problem, and to correct it, BEFORE a second problem occurs. A second
problem could potentially cause a loss of data accessability in that array.
System configuration design, to hold multiple copies of data in different
arrays, can improve availability by reducing the liklihood of a total loss
of data access.
One of the ways in which disk drives are not perfectly reliable, is that
one part of the media holding some data may be unreadable, while the rest
of the disk media is readable. Such events are, within limits (i.e. AFR,
MTBF) considered to be normal occurrances. However, the only time that it
is possible to determine which parts of the disk are readable is when
something actually attempts a read to that area on the disk.
1. In order to ensure the continued data availability for all T3+ arrays,
it is essential to run the 'vol verify fix' command on all redundant T3
volumes. The 'vol verify fix' command should be run immediately after
completing the patch installation and then on a strongly recommended
schedule of every 30 days thereafter.
The recommended process for running "vol verify fix" is:
a) Evaluate the best time, frequency and command line options at
your site, for running 'vol verify fix'.
b) Before running the command, ensure that you have a full, verified
backup for each of the T3+ volumes on which you will be running
'vol verify fix'.
c) At a suitable time, preferably when the I/O load to the T3+ is
minimal, run 'vol verify fix' against each redundant volume in
turn.
d) Either while it is running, or after it has completed, check the
T3+ syslog (or remote log, if this is configured). This is to
confirm that there were no instances of any RAID mirror
data/parity mismatches being detected by the "vol verify fix"
process.
2. The 'vol verify fix' command has two essential purposes:
A. Every disk block in a redundant T3+ volume is read.
The T3+ does not know about host filesystems, unused space or OS
partition layouts etc., and so 'vol verify fix' is able to read
every disk block in the T3+ volume.
Any disk which has blocks that are found to be unreadable will have
the unreadable data reconstructed by the T3+ firmware from other
disks in that volume. This reconstructed data will then be
rewritten to the original disk to replace the previously unreadable
data. This is what normally occurs in response to a host read.
The areas of the T3+ volume which have never been read (e.g. unused
space in a filesystem) or areas which have only been written but
never read (e.g. unarchived database redo logs) are currently not
checked automatically. Depending upon the host usage, application,
data layout, etc., for all blocks to absolutely be read will be
unlikely or even impossible.
B. Parity blocks are checked to verify the expected values that are
calculated from the data blocks.
If the parity blocks do not match the expected value, then 'vol
verify fix' rewrites the parity blocks to contain the corrected
value. However, even though the parity and data are now consistent
(as a result of the 'vol verify fix' being run), the fact that the
data blacks and the parity blocks were ever inconsistent means that
the data in that particular volume CANNOT be relied on to be valid.
By rewriting the parity blocks to be consistent with the data
blocks, it is essential to understand that this process will NOT
cause data integrity problems. But neither does it guarantee to
fix data integrity problems or to provide valid data. It only
means that a data integrity problem was present on that volume
PRIOR to running the 'vol verify fix' command. Such mirror
data/parity mismatches are VERY rare events, but it is possible
for them to occur as a result of rare hardware failures or
undocumented administration procedures.
3. As part of the process of running 'vol verify fix', it is absolutely
necessary to review the T3+ syslog (or remote log) entries for the time
period that the 'vol verify fix' was running. This is to ensure there
are no mirror data/parity mismatches reported. This can done by
monitoring the remote T3+ log in real time, during the 'vol verify fix'
or by checking the T3+ syslog after the 'vol verify fix' has completed.
Mirror data/parity error message examples:
RAID-1
------
nws-encl51+52:/:<12>vol verify r1
Jun 23 17:00:01 WXFT[1]: N: u1ctr Verify failed on block AF
Jun 23 17:00:01 WXFT[1]: N: u1ctr Verify failed on block B0
Jun 23 17:00:01 WXFT[1]: N: u1ctr Verify failed on block B1
nws-encl51+52:/:<12>vol verify r1 fix
Jun 23 17:08:16 WXFT[1]: N: u1ctr Mirror block AF is fixed in vol (r1)
Jun 23 17:08:16 WXFT[1]: N: u1ctr Attempting to fix block B0 in vol (r1)
Jun 23 17:08:16 WXFT[1]: N: u1ctr Mirror block B0 is fixed in vol (r1)
Jun 23 17:08:16 WXFT[1]: N: u1ctr Attempting to fix block B1 in vol (r1)
Jun 23 17:08:16 WXFT[1]: N: u1ctr Mirror block B1 is fixed in vol (r1)
RAID-5
------
Jun 06 17:06:26 sh01[1]: N: vol verify v0 fix
Jun 06 17:06:28 sh01[1]: N: Volume v0 verification started
Jun 06 17:06:30 WXFT[1]: N: u1ctr Attempting to fix parity on stripe 0
in vol (v0)
Jun 06 17:06:30 WXFT[1]: N: u1ctr Parity on stripe 0 is fixed in vol(v0)
Jun 06 17:06:30 WXFT[1]: N: u1ctr Attempting to fix parity on stripe 1
in vol (v0)
Jun 06 17:06:30 WXFT[1]: N: u1ctr Parity on stripe 1 is fixed in vol(v0)
4. If a mirror data/parity mismatch is reported as having been corrected,
then the data in that T3+ volume cannot be relied upon to be valid. If
this happens, it is strongly recommended to:
a) use your application to verify the validity of the data (if this
(procedure is available) and/or
b) restore from your latest verified backup (or follow your equivalent
local procedure) and/or
c) provide details of the history for the affected array (i.e. any
changes that have been made) to your service provider and request
advice how to proceed.
It is NOT recommended to ignore any mirror data/parity mismatch warning
messages.
After running 'vol verify fix', one warning and only one warning will
ever be written for each stripe that has a mirror data/parity mismatch.
So unless there is an underlying mechanism which causes further mirror
data/parity mismatches on that particular stripe, the process of running
'vol verify fix' corrects the mismatch and there will never be a need
for any additional warnings on that stripe.
Therefore, in the very unlikely event that you see a mirror data/parity
mismatch warning message, and you did not expect to see it, we STRONGLY
recommend that you perform one or more of the above three actions.
Comments:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Sun Services will attempt to contact
all affected customers to recommend implementation of the FIN.
ii) For CONTROLLED PROACTIVE FINs, Sun Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Sun Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.central/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Central/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://spe.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------