Document Audience:INTERNAL
Document ID:I0876-3
Title:Patch 112276-06 (Firmware 2.01.03) and later for Sun StorEdge T3+ (T3B) Arrays resolves several disk error handling issues. SunAlert: Yes
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2003-06-27

---------------------------------------------------------
            - Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                        FIELD INFORMATION NOTICE
               (For Authorized Distribution by SunService)
FIN #: I0876-3
Synopsis: Patch 112276-06 (Firmware 2.01.03) and later for Sun StorEdge T3+ (T3B) Arrays resolves several disk error handling issues. SunAlert: Yes
Create Date: Jun/27/03
SunAlert: Yes
Top FIN/FCO Report: Yes
Products Reference: Sun StorEdge T3+/3910/3960/6910/6960
Product Category: StorEdge / SW Admin
Product Affected: 
Systems Affected:
-----------------  
Mkt_ID   Platform   Model   Description                  Serial Number
------   --------   -----   -----------                  -------------
  -      ANYSYS       -     System Platform Independent        -


X-Options Affected:
-------------------
Mkt_ID       Platform   Model   Description              Serial Number
------       --------   -----   -----------              -------------
  -           T3+        ALL    T3+ StorEdge Array             -
  -	      3910       ALL    Sun StorEdge 3910 Array	       -
  -	      3960	 ALL	Sun StorEdge 3960 Array	       -
  -           6910       ALL    Sun StorEdge 6910 Array        -
  -           6960       ALL    Sun StorEdge 6960 Array        -
Parts Affected: 
Part Number     Description           Model
-----------     -----------           -----
     -               -                  -
References: 
BugId:      4697868   - disk in raid 5 on T3+ failed and Oracle database 
                        crashed in clustered config.

            4707617   - Unrecovered Read Error during vol verify fix
                        operation not corrected

FIN:        I0936-1   - Special pre-installation procedures are required 
                        to prevent loss of volume access with F/W Update 
                        patch 109115-12 (FW 1.18.1) on Sun StorEdge 
                        T3 (T3A) Arrays.
                        
            I0966-1   - Best practices guidelines are available for 
                        StorEdge T3/T3+ arrays which encounter 
                        "disk error 03" messages.                        

PatchId:    112276-06 - T3+ 2.01.03: System Firmware Update.
            112276-07 - T3+ 2.01.03: System Firmware Update.

Sun Alert:  52562     - Special Firmware Installation Procedures Are 
                        Required to Prevent Loss of Volume Access on 
                        StorEdge T3/T3+ Arrays.
Issue Description: 
-------------------------------------------------------------------------
| CHANGE HISTORY                                                          |
| ==============                                                          |
|                                                                         |
|  FIN I0876-3 from I0876-2                                               |
|                                                                         |
|  DATE MODIFIED: May 21, 2003                                            |
|                                                                         |
|  UPDATES:  PROBLEM DESCRIPTION, CORRECTIVE ACTION                       |
|                                                                         |
|    PROBLEM DESCRIPTION:                                                 |
|    --------------------                                                 |
|                        . NOTE on the purpose of this FIN has been added |
|                        . Modified entire Problem Description section to |
|                          include new features 'vol verify fix' & others |
|                        . Removed SPECIAL NOTE section to generate       |
|                          a seperate FIN to address its contents         |
|                        . Revised WARNING section                        |
|                                                                         |
|                                                                         |
|    CORRECTIVE ACTION:                                                   |
|    ------------------                                                   |
|                        . Revised entire corrective action section       |
|                        . Added Post-Install instructions related to     |
|                          'vol verify fix' command                       |
|                                                                         |
 -------------------------------------------------------------------------

NOTE:  Due to the significant time and effort that may be required or
       perhaps completely avoided, the subject matter contained in this 
       FIN MUST be read and understood from beginning to end. 

Sun StorEdge T3+ arrays with firmware versions prior to 2.01.03 may be
susceptible to loss of availability. This situation can occur when
certain disk errors are encountered. 

Depending upon the

    1) configuration,
    2) application,
    3) type of volume manager,

in use, the host may

    1) continue the retry read/write operations,
    2) unmount the volume,
    3) cause the application to timeout.

If a disk drive experiences one of the following errors,

    "Sense Key = 0x4",
    "Sense Key = 0x01, Asc = 0x5d"

the T3+ will repeatedly retry read/write operations. This may appear to
the host as if the T3+ is not responding.

This issue has been resolved with firmware patch 112276-06 and later.
With firmware version 2.01.03 and above, the error handling of these
particular disk errors has been enhanced to appropriately disable the
affected drive.

If more than one disk in a given RAID volume reports these particular
errors, then ALL of the drives that report these errors WILL be
disabled. This will result in the volume being unmounted and will cause
a loss of access to data.

To avoid the potential loss of access to data, special pre-installation
and post-installation procedures are required with this patch and are
detailed below. 


>>>>> WARNING! <<<<<

    Patch 112276-06 is for the T3+ (T3B) ONLY. Do not install this patch on
    a T3 (T3A).  Use the "ver" command to see if you have a T3+ (T3B), as
    shown below.

         hws27-41:/:<8>ver

         T3B Release 2.01.01 2002/07/30 19:16:42 (10.4.27.41)
         Copyright (C) 1997-2001 Sun Microsystems, Inc.
         All Rights Reserved. 

    Plan to allocate the necessary time and effort for the entire upgrade 
    process based on the need to complete the following non-trivial tasks:
         
         1) Backup all volume data
         2) Review all available syslogs for drive errors
         3) Run the 'vol verify' procedures and any standard corrective
            actions
         4) Replace all failed drives
         5) a) Upgrade all T3+ arrays 
            b) Upgrade 3910/3960/6910/6960 SP images, including all T3+ arrays
         6) Run the 'vol verify fix' procedures

    Without first reviewing the T3+ syslog file(s) for possible drive errors
    and then taking the necessary pro-active** action, installing patch 
    112276 may result in drives becoming disabled which can lead to a loss
    of volume access. If prior to the installation of patch 112276-06, any 
    drives have exhibited the errors listed in this FIN, those drives WILL be 
    disabled when patch 112276-06 or newer is installed.
	    
    **Pro-active means having spare disks available and immediately replacing
      drives that have the errors listed in this document, prior to installing 
      patch 112276-06.

>>>>> END WARNING <<<<<<<


The affected systems, listed in the 'PRODUCTS AFFECTED:' section above, 
include any StorEdge T3+ (T3B) array that does not have firmware version
2.01.03 or above.  This firmware is available in patch 112276-06. See 
FIN I0936-1 for a similar issue with the T3 (T3A) array.

For T3+ arrays with firmware versions lower than 2.01.03, the T3+ syslog 
file may show multiple error messages of the following types:

   A. More than one "Sense Key = 0x4" error on one specific drive.
      Example:

      Jun 05 06:16:14 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
      Jun 05 06:16:14 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
      Jun 06 08:36:19 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
      Jun 06 08:36:19 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1

   AND/OR

   B. A single "Sense Key = 0x1, Asc = 0x5d" error on one specific drive.
      Example:

      Jul 31 16:19:22 ISR1[1]: N: u1d3 SCSI Disk Error Occurred (path = 0x1)
      Jul 31 16:19:22 ISR1[1]: N: Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
      Jul 31 16:19:22 ISR1[1]: N: Sense Data Description = Failure Prediction
                                  Threshold Exceeded


NOTE:  Patch 112276-06 provides enhancements to the 'vol verify' and
       'vol verify fix' commands as described below:

  1. Previously, the 'vol verify' command terminated at the first
     occurrence of a disk error.  The code has been modified to scan
     the whole volume for any errors or parity mismatches, even when disk 
     error of type 'Media error' (Sense Key = 0x3, Asc =0x11, Ascq = 0x0) 
     is encountered.
 
  2. Previously, the 'vol verify fix' command terminated at the first
     occurrence of a disk 'Media error'.  The code has been modified to
     regenerate the valid data from other disks in the volume, whenever 
     any one disk in a given volume encounters a disk 'Media error' 
     (Sense key = 0x3, ASC = 0x11). This is done by performing an alternate
     stripe operation to construct the good data from other drives, writing 
     it back to the bad block, and then letting the disk perform an auto 
     reallocation. If it is not possible to correct the error, the drive
     is marked as "failed" and the 'vol verify fix' command will terminate 
     at that point. Otherwise, it will continue to scan the entire volume.
     
     Sample disk media errors:
   
     Feb 10 02:37:49 ISR1[1]: W: u1d8 SCSI Disk Error Occurred (path = 0x0)
     Feb 10 02:37:49 ISR1[1]: W: Sense Key = 0x3, Asc = 0x11, Ascq = 0x0
     Feb 10 02:37:49 ISR1[1]: W: Sense Data Description = Unrecovered Read Error
     Feb 10 02:37:49 ISR1[1]: W: Valid Information = 0x12257ea
Implementation: 
---
          |   |   MANDATORY (Fully Proactive)
           ---


           ---
          | X |   CONTROLLED PROACTIVE (per Sun Geo Plan)
           ---


           ---
          |   |   REACTIVE (As Required)
           ---
Corrective Action: 
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned issue.

1. For All T3+ (T3B) arrays:

   . Follow the pre-installation instructions detailed below.
 
   . Install patch 112276-06 or later and strictly follow the procedures 
     listed in the 'Patch Installation Instructions' section of the patch 
     document.

    Follow the post-installation instructions detailed below.

2. For All 3910/3960/6910/6960:
     
   . Login to SP and type: "cat /etc/motd". 
     
   . Upgrade the Service Processor Image to rev 2.3.1 or above by accessing
     the image and following the README file for upgrade information at:
      
       http://edist.central
       http://futureworld.central/WSTL/PROJECTS/SPImage/Src/web/Downloads.shtml
     
   . Once the Service Processor upgrade is completed, the T3+ controller 
     firmware must be upgraded as a separate process. This will be patch 
     112276-06 or later, and will be contained in the particular SP image.

   . Follow the pre-installation instructions detailed below.
 
   . Install patch 112276-06 or later and strictly follow the procedures 
     listed in the 'Patch Installation Instructions' section of the patch 
     document. The procedures are also explained in the image README file 
     and in the 'Sun StorEdge(tm) 3900 and 6900 Series' Reference and 
     Service Manuals.

   . Follow the post-installation instructions detailed below. 


I. PATCH PRE-INSTALL INSTRUCTIONS: (T3+/3910/3960/6910/6960)
----------------------------------

  1. ftp the 'syslog' file from the T3+ where the firmware patch will
     be installed.
   
  2. Save this 'syslog' file to a local directory on the host system and 
     run the following command:

     % egrep -i
     '0x5D|Threshold|0x15|0x4|Mechanical|Positioning|Exceeded|Disk Error' syslog
     
       (This search command can be modified if site requirements are varied)     

  3. There is a chance that more than one disk will have these error codes 
     in the syslog. If this is the case, take a backup of all the files 
     residing on the volume/slices. These errors are fatal but may still 
     allow some i/o requests to continue. This is an emergency situation 
     because the volume may not be available due to the presence of these 
     errors. Disks with 0x1/0x5d and 0x4/32 errors must be replaced because 
     they are about to fail or have already failed. If this is the case, 
     a Volume backup may fail because of dual disk failure.

  4. After the backup, the volumes should be recreated and reinitialized 
     before restoring the data from the backup. This will reassign all bad
     blocks from the volumes.       

  5. If there is a situation, although highly unadvised, where a back-up
     cannot be taken, the 'vol verify' command should continue to be run
     until it fully runs to completion. 

  6. An alternate solution to the 'vol verify' command is described in 
     the 'Work Around:' section of BugID 4707617.

  7. Ensure the volume is now in an optimal working state without any 
     drives disabled and then continue with the patch install. 

Error Examples:
  
   Here 'u2d5' and  'u1d3' shows the location of drives.

   test_host% egrep -i
     
      '0x5D|Threshold|0x15|0x4|Mechanical|Positioning|Exceeded|Disk Error' 
      syslog

      Jun 05 06:16:14 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
      Jun 05 06:16:14 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
      Jun 05 06:16:14 ISR1[2]: W: Sense Data Description = Mechanical 
                                  Positioning Error
      Jun 06 08:36:19 ISR1[2]: W: u2d5 SCSI Disk Error Occurred (path = 0x0)
      Jun 06 08:36:19 ISR1[2]: W: Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
      Jun 06 08:36:19 ISR1[2]: W: Sense Data Description = Mechanical 
                                  Positioning Error
                                  
   AND/OR

      Jul 31 16:19:22 ISR1[1]: N: u1d3 SCSI Disk Error Occurred (path = 0x1)
      Jul 31 16:19:22 ISR1[1]: N: Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
      Jul 31 16:19:22 ISR1[1]: N: Sense Data Description = Failure Prediction
                                   Threshold Exceeded


II. PATCH INSTALL:
------------------

Based on the 'CORRECTIVE ACTION:' section listed above, download and
install T3+ patch 112276-06 or the 3910/3960/6910/6960 SP image with
the bundled T3+ patch 112276-06.

NOTE: The patch README explains the pre-install section for specifics.
      It maybe helpful to review the entire process; such as; find
      patch online, download patch, unpack patch, and then review the
      different docs provided with all the patches.


III. PATCH POST-INSTALL INSTRUCTIONS:(T3+/3910/3960/6910/6960)
-------------------------------------

  1. Run the 'vol verify fix' command and any standard corrective actions.
     Refer to the release notes for v2.01.03 or BugID 4707617 for 
     additional reference.

  2. It will be essential to continue to run the 'vol verify fix' 
     procedure EVERY 30 days to maintain the health of your drives.  

     This must become an integral part of the ongoing maintenance
     necessary for the continued reliabilty and accessability for the
     storage arrays.  Failure to do so will continue to raise the
     unjustified costs associated with the high rate of drives being
     replaced and then determined to be NTF (No Trouble Found).
     
     
III. PATCH POST-INSTALL INSTRUCTIONS:(T3+/3910/3960/6910/6960)
-------------------------------------

Note: As is always the case, the use of RAID protected disk subsystems does 
      not eliminate the need for regular, verified data backups.

    Any disk RAID subsystem can survive only a certain number of failures. This
    depends on the RAID level and other factors, before valid data is no longer
    available. A RAID subsystem like a T3+, (with a redundant RAID level other 
    than RAID-0), is designed to survive any typical single failures while 
    still being able to supply valid data to the host.  However, RAID 
    subsystems are not designed to survive all cases of two failures being 
    present at the same time, even if those two failures result from different
    causes and the initial failures occurred at different times.
  
    Therefore it is necessary to quickly identify that there is only a single
    problem, and to correct it, BEFORE a second problem occurs.  A second 
    problem could potentially cause a loss of data accessability in that array.
    System configuration design, to hold multiple copies of data in different 
    arrays, can improve availability by reducing the liklihood of a total loss
    of data access.
  
    One of the ways in which disk drives are not perfectly reliable, is that 
    one part of the media holding some data may be unreadable, while the rest 
    of the disk media is readable.  Such events are, within limits (i.e. AFR,
    MTBF) considered to be normal occurrances.  However, the only time that it
    is possible to determine which parts of the disk are readable is when 
    something actually attempts a read to that area on the disk.


   1. In order to ensure the continued data availability for all T3+ arrays,
      it is essential to run the 'vol verify fix' command on all redundant T3 
      volumes. The 'vol verify fix' command should be run immediately after 
      completing the patch installation and then on a strongly recommended 
      schedule of every 30 days thereafter.

      The recommended process for running "vol verify fix" is:
  
         a) Evaluate the best time, frequency and command line options at 
            your site, for running 'vol verify fix'. 
  
         b) Before running the command, ensure that you have a full, verified
            backup for each of the T3+ volumes on which you will be running 
            'vol verify fix'.
 
         c) At a suitable time, preferably when the I/O load to the T3+ is 
            minimal, run 'vol verify fix' against each redundant volume in 
            turn.
  
         d) Either while it is running, or after it has completed, check the
            T3+ syslog (or remote log, if this is configured). This is to
            confirm that there were no instances of any RAID mirror 
            data/parity mismatches being detected by the "vol verify fix" 
            process. 

   2. The 'vol verify fix' command has two essential purposes:

      A. Every disk block in a redundant T3+ volume is read. 

         The T3+ does not know about host filesystems, unused space or OS
         partition layouts etc., and so 'vol verify fix' is able to read 
         every disk block in the T3+ volume.
 
         Any disk which has blocks that are found to be unreadable will have
         the unreadable data reconstructed by the T3+ firmware from other 
         disks in that volume.  This reconstructed data will then be 
         rewritten to the original disk to replace the previously unreadable
         data. This is what normally occurs in response to a host read. 

         The areas of the T3+ volume which have never been read (e.g. unused
         space in a filesystem) or areas which have only been written but 
         never read (e.g. unarchived database redo logs) are currently not 
         checked automatically. Depending upon the host usage, application,
         data layout, etc., for all blocks to absolutely be read will be 
         unlikely or even impossible.
  
      B. Parity blocks are checked to verify the expected values that are
         calculated from the data blocks.
 
         If the parity blocks do not match the expected value, then 'vol 
         verify fix' rewrites the parity blocks to contain the corrected 
         value. However, even though the parity and data are now consistent 
         (as a result of the 'vol verify fix' being run), the fact that the 
         data blacks and the parity blocks were ever inconsistent means that
         the data in that particular volume CANNOT be relied on to be valid.

         By rewriting the parity blocks to be consistent with the data 
         blocks, it is essential to understand that this process will NOT 
         cause data integrity problems. But neither does it guarantee to 
         fix data integrity problems or to provide valid data.  It only 
         means that a data integrity problem was present on that volume 
         PRIOR to running the 'vol verify fix' command.  Such mirror 
         data/parity mismatches are VERY rare events, but it is possible 
         for them to occur as a result of rare hardware failures or 
         undocumented administration procedures.
  
   3. As part of the process of running 'vol verify fix', it is absolutely 
      necessary to review the T3+ syslog (or remote log) entries for the time 
      period that the 'vol verify fix' was running. This is to ensure there 
      are no mirror data/parity mismatches reported. This can done by 
      monitoring the remote T3+ log in real time, during the 'vol verify fix'
      or by checking the T3+ syslog after the 'vol verify fix' has completed.

      Mirror data/parity error message examples:

       RAID-1
       ------

       nws-encl51+52:/:<12>vol verify r1
       Jun 23 17:00:01 WXFT[1]: N: u1ctr Verify failed on block AF
       Jun 23 17:00:01 WXFT[1]: N: u1ctr Verify failed on block B0
       Jun 23 17:00:01 WXFT[1]: N: u1ctr Verify failed on block B1

       nws-encl51+52:/:<12>vol verify r1 fix
       Jun 23 17:08:16 WXFT[1]: N: u1ctr Mirror block AF is fixed in vol (r1)
       Jun 23 17:08:16 WXFT[1]: N: u1ctr Attempting to fix block B0 in vol (r1)
       Jun 23 17:08:16 WXFT[1]: N: u1ctr Mirror block B0 is fixed in vol (r1)
       Jun 23 17:08:16 WXFT[1]: N: u1ctr Attempting to fix block B1 in vol (r1)
       Jun 23 17:08:16 WXFT[1]: N: u1ctr Mirror block B1 is fixed in vol (r1)


       RAID-5
       ------

       Jun 06 17:06:26 sh01[1]: N: vol verify v0 fix
       Jun 06 17:06:28 sh01[1]: N: Volume v0 verification started
       Jun 06 17:06:30 WXFT[1]: N: u1ctr Attempting to fix parity on stripe 0 
                                   in vol (v0)
       Jun 06 17:06:30 WXFT[1]: N: u1ctr Parity on stripe 0 is fixed in vol(v0)
       Jun 06 17:06:30 WXFT[1]: N: u1ctr Attempting to fix parity on stripe 1 
                                   in vol (v0)
       Jun 06 17:06:30 WXFT[1]: N: u1ctr Parity on stripe 1 is fixed in vol(v0)
  
  
    4. If a mirror data/parity mismatch is reported as having been corrected, 
       then the data in that T3+ volume cannot be relied upon to be valid. If 
       this happens, it is strongly recommended to:
  
         a) use your application to verify the validity of the data (if this 
            (procedure is available) and/or
  
         b) restore from your latest verified backup (or follow your equivalent
            local procedure) and/or
  
         c) provide details of the history for the affected array (i.e. any 
            changes that have been made) to your service provider and request 
            advice how to proceed.
  
       It is NOT recommended to ignore any mirror data/parity mismatch warning 
       messages.  

       After running 'vol verify fix', one warning and only one warning will 
       ever be written for each stripe that has a mirror data/parity mismatch. 
       So unless there is an underlying mechanism which causes further mirror
       data/parity mismatches on that particular stripe, the process of running
       'vol verify fix' corrects the mismatch and there will never be a need 
       for any additional warnings on that stripe.

       Therefore, in the very unlikely event that you see a mirror data/parity
       mismatch warning message, and you did not expect to see it, we STRONGLY 
       recommend that you perform one or more of the above three actions.
Comments: 
None

============================================================================
Implementation Footnote: 
i)   In case of MANDATORY FINs, Sun Services will attempt to contact   
     all affected customers to recommend implementation of the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------
Statusactive