Document Audience:INTERNAL
Document ID:A0199-2
Title:Seagate Cheetah 3 FCAL disk drives in A5x00 and T3 Storage Arrays may experience higher than expected failure rate.
Copyright Notice:Copyright © 2007 Sun Microsystems, Inc. All Rights Reserved
Update Date:Tue Sep 02 00:00:00 MDT 2003

----------------------------------------------------------------------------
            - Sun Proprietary/Confidential: Internal Use Only -
----------------------------------------------------------------------------

                             FIELD CHANGE ORDER
            (For Authorized Distribution by Enterprise Services)
            
FCO #: A0199-2
Status: inactive
Synopsis: Seagate Cheetah 3 FCAL disk drives in A5x00 and T3 Storage Arrays may experience higher than expected failure rate.
Date: Sep/02/2003
SunAlert: No
Top FIN/FCO Report: Yes
Products Reference: Seagate FCAL Cheetah 3 Disk Drive
Product Category: Storage / Disk
Product Affected: 
Systems Affected:
 
Mkt_ID   Platform    Model   Description                 Serial Number
------   --------    -----   -----------                 -------------
-         Anysys      All    System Platform Independent       -

X-Options Affected:
 
Mkt_ID   Platform    Model   Description            Serial Number
------   --------    -----   -----------	    -------------
  -        T3         All    T3 StorEdge Array            -
  -        T3+        All    T3+ StorEdge Array           -
  -       A5200       All    A5200 StorEdge Array         -
  -       A5100       All    A5100 StorEdge Array         -
Parts Affected: 
Part Number	Description             		Model
-----------	-----------	  			-----
540-4440-01	18GB 10K FCAL Disk (T3 FRU)  		ST318203FC (Cheetah 3)
540-4673-01	18GB 10K FCAL Disk (SPUD Bracket) 	ST318203FC (Cheetah 3)
540-4367-01	36GB 10K FCAL Disk (T3 FRU) 		ST136403FC (Cheetah 3)
540-4191-01	18GB 10K FCAL Disk (SPUD Bracket) 	ST318203FC (Cheetah 3)
540-3869-01	9GB 10K FCAL Disk (SPUD Bracket)	ST39103FC  (Cheetah 3)
540-4192-01	36GB 10K FCAL Disk (SPUD Bracket)	ST136403FC (Cheetah 3)


(SCSI Devices)
Type   Vendor    Model     SerialNumber(Min)   SerialNumber(Max)   Firmware
----   ------    -------   -----------------   -----------------   --------
Disk   Seagate   ST318203FC       -                   -            < DF4A
Disk   Seagate   ST136403FC       -                   -            < DF4A
Disk   Seagate   ST39103FC        -                   -            < DF4A
References: 
ESC: 537137               
  ESC: 536084
  ESC: 534571   
  ESC: 535693     
  DPCO: 312   
  DPCO: 322   
  PatchID: 111535-03 (or lastest revision)
Issue Description: 
Change History 
--------------

A0199-2 (3)

 Date Modified: Aug/07/2003
 Updates: SYNOPSIS, PROBLEM DESCRIPTION, CORRECTIVE ACTION
 . SYNOPSIS: shortened by removing "when left in an "idle" state for
 	     extended periods".
 . PROBLEM DESCRIPTION: simpified and clarified entire section
 . CORRECTIVE ACTION: simpified and clarified entire section
 
---

A0199-2 (2)

 Date Modified: Apr/17/2003
 Updates: AFFECTED PARTS: Firmware
 . AFFECTED PARTS: Changed Firmware from "< D94A" to "< DF4A"

---

A0199-2 (1)

 Date Modified: Sep/27/2002
 Updates: SYNOPSIS, REFERENCES, PROBLEM DESCRIPTION, SPECIAL CONSIDERATIONS
 . SYNOPSIS: Added reference to A5x00 and T3 Storage Arrays
 . REFERENCES: PatchID revision change from -01 to -03
 . PROBLEM DESCRIPTION: Added Note about internal system drives
 . SPECIAL CONSIDERATIONS: PatchID rev change from -01 to -03

--------------

Seagate Cheetah 3 FCAL disk drives may experience higher than expected
failure rate.  The experienced failure mode is Offline or Not Ready.

Root Cause of this issue is due to low resistance, intermittent
shorts between adjacent pins in flash ROM memory chips that use a
red phosphorous inorganic fire retardant molding compound.  These
shorts are caused by electro-migration of metal across adjacent
pins of these memory chips.

The typical senario involves drives that go into an "idle" state
with power applied.  A small percentage of these drives may have
experienced this failure already, but it will not be evident until
a data transfer is initiated.  Until then the failed drive appears
to be working and may even respond to a Test Unit Ready command.

Note: Due to the operating profile of internal system drives (minimal 
      idle time), these drives are not exposed to this issue.

New firmware has been developed that decreases the risk of this
electro-migration from occuring by exercising the drive FLASH ROM
thereby avoiding idle time bias voltage for FLASH ROM pin pairs.
Installing the firmware also will "shake-out" any Seagate Cheetah 3
drives sitting in an idle-failed state, or in a marginal close to
failure state.

Note: The FW also releases two drive spin down "watch dog reset"
fixes.  These fixes are:   

   - Watchdog Timer Timeout Due to Failure to Busy The Interface
   
   This fix prevents an unexpected timeout and drive 
   spin-down from happening due to a "lost" pause in 
   the link services routine.
   
   - Watchdog Timer Timeout Due to Failure to Busy The Interface While 
     Aborting Commands
   
   This fix clears a queued command flag left set 
   after receiving an abort command. This caused the 
   watchdog timer to timeout and the drive to spin 
   down and reset.


As such it is recommended that all customer install the new FW. 

This FCO explains how to determine potential customer exposure, how to
install new firmware that will run an I/O check (which will "wake-up"
any Seagate Cheetah 3 drives out of an "idle" state), and how to plan
for possible failed drives that will be evident after the I/O check is
run.

Please follow the directions listed in the SPECIAL INSTRUCTIONS and
CORRECTIVE ACTION sections of this FCO.  

For those customers who are considered secure sites and don't allow
for the removal of disk drives from the site please refer to the
COMMENTS section at the end of this FCO.

Example Error Messages
----------------------

T3 Purple Error Messages (Typical):

May 07 22:42:29 ISR1[1]: N: u1d6 SVD_RETRY: Retries Exhausted (ccb =
0x1384034)
May 07 22:42:29 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x4
May 07 22:42:29 ANNT[1]: W: u1d6: Failed
May 07 22:42:31 LPCT[1]: N: u1d6: Bypassed on loop 1
May 07 22:42:31 ISR1[1]: N: u1d6 SVD_RETRY: Retries Exhausted (ccb =
0x13918f4)
May 07 22:42:31 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x4
May 07 22:42:31 snmp[1]: N: u1d6 ioctl disk failed err=1
May 07 22:42:31 LPCT[1]: N: u1d6: Bypassed on loop 2

Mar 23 23:01:16 ISR1[2]: N: u1d7 SVD_DONE: Command Error = 0x3
Mar 23 23:01:16 ISR1[2]: W: u1d7 SCSI Disk Error Occurred (path = 0x0)
Mar 23 23:01:16 ISR1[2]: W: Sense Key = 0x2, Asc = 0x4, Ascq = 0x2
Mar 23 23:01:16 ISR1[2]: W: Sense Data Description = Logical Unit Not Ready,

A5x00 Photon error messages (Typical):

socal3: port 0: Fibre Channel is OFFLINE
WARNING: /sbus@b,0/SUNW,socal@0,0/sf@1,0/ssd@w2100002037aef8c8,0 (ssd164):
SCSI transport failed: reason 'tran_err': retrying command

socal3: port 1: Fibre Channel is OFFLINE
WARNING: /sbus@3,0/SUNW,socal@0,0/sf@1,0/ssd@w21000020378837f2,0 (ssd171):
SCSI transport failed: reason 'timeout': giving up

May 20 10:59:58 gmi2wap011 scsi: [ID 107833 kern.notice] Sense Key: Not Ready
May 20 10:59:58 gmi2wap011 scsi: [ID 107833 kern.notice] ASC: 0x4 (), ASCQ:
0x1, FRU: 0x2

Specific environmental conditions which exacerbate drive failures are;

 . exposure to high humidity
 . drive idle time

Drive idle or constant dwell, with power on, provides a voltage bias
which promotes electro-migration of silver/copper between the failing,
adjacent pins on the chip package.

A Sun legal approved Customer Letter can be found at the following URL;

 http://sdpsweb.EBay/FIN_FCO/FCO/FCO_A0199-1_Dir/CustomerLetter.sxw

Note: To view document click on the above URL, then save to your local
     disk using your Netscape 'file' button and select 'save as', then
     open file locally using StarOffice.

All field RSLs were purged via DPCO 312 starting on June 19, 2002.
Corrective action was made available by releasing patch 111535-03
on September 24, 2002.
Parts Affected: 
July 30, 2003
Implementation: 
---
|   |   MANDATORY (Fully Pro-Active)
 ---

 ---
| X |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
 ---

 ---
|   |   UPON FAILURE
 ---
Replacement Time Estimate: 
2 hours
Special Considerations: 
This FCO will have a timezone phased release based on material
readiness as follows:

US/Canada       August 30, 2002
EMEA            September 11, 2002
Ltn America     September 11, 2002
APac/Japan      September 18, 2002


IMPORTANT: Implementation of this FCO must be performed in two steps:

1. Controlled Proactive Patch installation - PatchID 111535-03 (or later).
   
2. Replacement of failed disks.  Only disks identified as faulty
   by the installed patch should be replaced.
   
   Proactive swap of disk drives is NOT authorized by this FCO.

 ###########################################################

     ***  Seagate Cheetah 3 Disk Drive FW Install  ***

 ###########################################################

The FCO corrective action consists of a short term immediate action, and a
longer term monitoring action.  Prior to performing the firmware install
ensure you:

        A. Read the complete FCO Instructions

        B. Understand the disk drive reliability screen

        C. Have advised your customer of possible results from installing
           the firmware provided via this FCO

        D. Are prepared, from a time and material perspective (meaning the
           customer has provided an adequate maintenance window, and you
           have staged a sufficient number of spares on site) prior to
           implementing the on-site firmware upgrade.

+ Short Term Immediate Action:
  ----------------------------

OVERVIEW:

        A. PREPARE FOR DISK DRIVE FW UPGRADE:

                A1. IDENTIFY SUSPECT DRIVE POPLULATION AT CUSTOMER SITE
                A2. CALCULATE NUMBER OF SPARES REQUIRED TO SUPPORT INSTALL
                    OF HEALTHCHECK SCRIPT AND FIRMWARE

        B. PLAN/PREPARE FOR FIRMWARE INSTALL

                B1. INSTALL NEW FIRMWARE
                B2. IDENTIFY ANY LATENT (HIDDEN) FAILED DRIVES
                B3. SWAP ANY FAILED DRIVES WITH SPARE DRIVES
                B4. RERUN THE FW SCRIPT TO UPDATE THE FW OF ANY REPLACED DRIVES
                B5. VERIFY PROPER OPERATION OF SYSTEM. 


+ Long Term Monitoring Action:
  ---------------------------

        A. MONITOR THE SITE FOR CHEETAH 3 DRIVE FAILURES
        B. IF FAILURE RATE REMAINS ABOVE EXPECTED LEVELS ESCALATE SERVICE ORDER
           PER LOCAL GUIDELINES (reference #6 for failure rates)

NOTE:
-----

The firmware provides several fixes, including two fixes for 'watch dog reset'
or drive spin down failures.  The FW also includes an on the drive function
to exercise flash ROM chip pins during drive idle periods.  This fix
provides frequent signals across pin pairs thus reducing or eliminating
contributing factors to the pin to pin copper migration failure mechanism. 

Engineering review of the initial customer sites reporting failed
Cheetah 3 disk drives revealed the following:

        - Engineering failure analysis of Cheetah 3 drives with failed
          flash ROM chips ALL displayed a similar drive usage pattern;

                - Idle time followed by;
                - A sudden increase in I/O activity.

        - Each customer site contained a small percentage of drives that
          had been in a idle mode.

        - Idle drives can be the function of:

                - data base access; the data is there, but is not accessed
                - striping format and read/writes
                - test systems

Due to the engineering findings:

        - Be advised that your customer may have suspect drives.
        - Drives with the highest probality of failing due to this mechanicsm
          are drives that have been sitting powered on and idle.
        - Installing the FW will "wake-up" idle drives by performing thousands
          of simulated I/Os.  Running these I/Os will reveal any Seagate 
          Cheetah 3 drive sitting in an idle-failed state, and will also 
          "shake-out" marginal drives.
        - Before running the Firmware script it is critical that you 
          evaluate the site, complete analysis, and have an appropriate number 
          of spares on site when performing the firmware install.

NOTE:

The FW provided via this FCO will "wake up" and fail disk drives that
have been in an idle-failed state.  Likewise the "wake up" algorithm
is designed to shake out  "latent" failures (that is , failed drives
that have not yet reported failures). The very act of "waking up" the
drive population and downloading new firmware will, in addition to
identifying latent failed drives, accelerate the failure of those
drives that are marginal. While no definite time line to failure has
been identified for these "marginal" drives, there will be a period of
time during which drives will continue to fail. The rate at which they
fail will be significantly lower that the pre-FCO failure rate, but
still above expected norms. This rate could last for a few months as
the accelerated failures appear. The failure rate is then expected to
return to a more normal level for the remainder of the disk
population's life.


A. PREPARE FOR DISK DRIVE RELIABILITY SCREEN:

A1. IDENTIFY SUSPECT DRIVE POPLULATION AT CUSTOMER SITE
------------------------------------------------------------------------

            ***  Seagate Cheetah 3 Disk Drive FW install  ***

------------------------------------------------------------------------
#1.  Define the customer's install base by part number.  Use explorer or
     T3 extractors to identify the part numbers.

 Product/model          Part Number        Model    Quantity   Months of
                                                               Service

Servers/E3500           540-3869-01     (9.1GB)    ___________ ___________
ST39103FC
Servers/E3500           540-4191-01     (18.2GB)   ___________ ___________
ST318203FC
Servers/E3500           540-4673-01     (18.2GB)   ___________ ___________
ST318203FC
Servers/280R            540-4191-01     (18.2GB)   ___________ ___________
ST318203FC
Servers/SB 1000/2000    540-4673-01     (18.2GB)   ___________ ___________
ST318203FC
A5000 Array             540-4192-01     (36.4GB)   ___________ ___________
ST136403FC
A5100 Array             540-4192-01     (36.4GB)   ___________ ___________
ST136403FC
A5200 Array             540-3869-01     (9.1GB)    ___________ ___________
ST39103FC
T3                      540-4440-01     (18.2GB)   ___________ ___________
ST318203FC
T3                      540-4367-01     (36.4GB)   ___________ ___________
ST136403FC

    NOTE: You must run explorer script to identify all Seagate drives.


A2. CALCULATE NUMBER OF SPARES REQUIRED TO SUPPORT INSTALL OF DRIVE FIRMWARE
-----------------------------------------------------------------------------



Part Number       Model   Drive        Forecast Factor  Results:
                          Population   Failures %       Drives required on
                                                        site to install new
                                                        firmware
-----------     --------  -----------  --------------   --------------

540-3869-01/    (9.1GB)    _________    2%              _____________
ST39103FC
540-4191-01/    (18.2GB)   _________    2%              _____________
ST318203FC
540-4673-01/    (18.2GB)   _________    2%              _____________
ST318203FC
540-4192-01/    (36.4GB)   _________    2%              _____________
ST136403FC
540-4440-01/    (18.2GB)   _________    2%              _____________
ST318203FC
540-4367-01/    (36.4GB)   _________    2%              _____________
ST136403FC

FORMULA: (Drive Population) x (Forecast Factor Failures) = Drives required for
                                                           on site

EXAMPLE:

Customer has 400 9GB drives, 300 18GB drives, and 500 36GB drives.

Part Number       Model   Drive        Forecast Factor  Results:
                          Population   Failures         Drives required on
                                                        site to install new
                                                        firmware
------------   --------   ------        ----------      -----------------

540-3869-01/    (9.1GB)    400          2% or .02                8

540-4440-01     (18.2GB)   300          2% or .02                6

540-4367-01     (36.4GB)   500          2% or .02               10

in this case:

(Drive Population) x (Forecast Factor Failures) = Drives required on-site

        400        x    .02             =                        8

        300        x    .02             =                        6

        500        x    .02             =                       10
                                                        -------------------

                Total number of spares to have on-site          24

Advise your customer that these results equal the forecasted number of
marginal state disk drives at the site.  This is the number of
failed drives that may be discovered when installing the new firmware.


B. PLAN/PREPARE FOR FIRMWARE INSTALL


B1. INSTALL NEW FIRMWARE - PatchID 111535-03 (or latest revision)
-----------------------------------------------------------------

#1. Ensure the number of spares required to support the firmware install
    are available and on-site.

#2. Install Cheetah 3 disk drive FW.

Reference: Patch-ID# 111535-03 (or latest revision)
Keywords:  ST39103FC ST318203FC ST136403FC 9GB 18GB 36GB disk
           firmware FC-AL

B2. IDENTIFY ANY LATENT (HIDDEN) FAILED DRIVES / MONITOR FOR ERRORS
-------------------------------------------------------------------

#3. Monitor systems for reported drive failures.

B3. SWAP ANY FAILED DRIVES WITH SPARE DRIVES
--------------------------------------------

#4. Replace failed drives as required.


B4. RERUN THE FW SCRIPT TO UPDATE THE FW OF ANY REPLACED DRIVES
---------------------------------------------------------------

Reference: Patch-ID# 111535-03 (or latest revision)
Keywords:  ST39103FC ST318203FC ST136403FC 9GB 18GB 36GB disk
           firmware FC-AL
           
Only required for any replaced drives.


B5. VERIFY PROPER OPERATION OF SYSTEM
-------------------------------------

#5. Restore target configuration and data (if neccessary), and verify
    proper operation of system.
    
FW INSTALLATION NOTES:

When the FW script is finished and you can examine the log file: 
        /111535-03/fco/logs.
        
Looking at the log file you may discover unexpected error messages.
Don't be alarmed. One of the scripts used by this patch sets
ssd_error_level system variable using adb to temporarily log more
messages. This results in the system capturing all sense key messages.
The script turns off debugging, so you will not see these errors in
normal conditions.  See bug 4758975 for more info.

Since the error level is turned on to report every sense key, you have
to filter errors out. For information on which errors you should pay
attention to, refer to the following online document:

http://webhome.central/harish/cgi-bin/help.cgi?SEARCH_STRING=cdb+failed+decoding 
+scsi+sense&GoBtn=GO

Looking at the above link, you can decode sense data.

- Sense key 1 & 2,  These are info only sense key messages.

- Sense key 3 & 4 are more severe, especially sense key 4 (Hardware error).
  Also look at the Error level: Fatal or Retryable.

You should identify disks that have sense key 4's. Since these errors
are seen during the upgrade process, we can't tell if they are
pre-upgrade or post-upgrade.  After completeing the upgrade, run dd or
dex to do some heavy IO to these disks and see if you see the same
disks reporting any sense key 3 or 4. Since the error logging is turned
off, you will not see extra errors, i.e. failed CDB.

Now use the procedures described in FCO to determine if this site is
still having drive issues.


+ Long Term Monitoring Action:
  ---------------------------

#1. MONITOR THE SITE FOR CHEETAH 3 FCAL DRIVE FAILURES.

#2. SWAP DRIVES (IF REQUIRED).

#3. SUBMIT DRIVES FOR CPAS AND OPEN AN ESCALATION IF FAILURE
    RATE REMAINS ABOVE EXPECTED LEVELS.

           Use the table below to evaluate if drive failure
           rates are above expectations:
           
                |---------------------------|
                | # of drives    | 12 months|
                |  on site       |          |
                |============================
                |     300        |    6     |
                |----------------|----------|
                |     400        |    8     |
                |----------------|----------|
                |     500        |   10     |
                |----------------|----------|
                |    1,000       |   20     |
                |----------------|----------|
                |    1,500       |   30     |
                |----------------|----------|
                |total  failures |          |
                |----------------------------
Corrective Action: 
IMPORTANT! Please follow the Disk Drive FW Install process
listed under the Special Consideration section of this FCO
prior to implementing any other corrective action.

Implementation of this FCO must be performed in two steps:

1. Controlled Proactive Patch installation - PatchID 111535-03 (or later).
   
2. Replacement of failed disks.  Only disks identified as faulty
   by the installed patch should be replaced.
   
*** Proactive swap of disk drives is NOT authorized by this FCO. ***

Upon failure replace as follows;

replace 540-3869-01 with 540-3869-01 (having DPCO 312 label)
replace 540-4191-01 with 540-4191-01 (having DPCO 312 label)
replace 540-4673-01 with 540-4673-01 (having DPCO 312 label)
replace 540-4192-01 with 540-4192-01 (having DPCO 312 label)
replace 540-4440-01 with 540-4440-01 (having DPCO 312 label)
replace 540-4367-01 with 540-4367-01 (having DPCO 312 label)
Comments: 
SECURE SITE ACTIVITY:

Please follow standard, local procedures for Secure Site replacements.
Billing Type: 
Warranty: Sun will provide parts at no charge under Warranty
           Service. On-Site Labor Rates are based on how the
           system was initially installed.

 Contract: Sun will provide parts at no charge. On-Site Labor Rates
           are based on the type of service contract.

 Non Contract: Sun will provide parts at no charge. Installation by
               Sun is available based on the On-Site Labor Rates
               defined in the Price List.

--------------------------------------------------------------------------
Implementation Footnote: 
________________________

i)   In case of Mandatory FCOs, Sun Services will attempt to contact
      all known customers to recommend the part upgrade.

ii)  For controlled proactive swap FCOs, Sun Services mission critical
     support teams will initiate proactive swap efforts for their respective
     accounts, as required.

iii) For Replace upon Failure FCOs, Sun Services partners will implement
     the necessary corrective actions as and when they are required.

--------------------------------------------------------------------------

All released FINs and FCOs can be accessed using your favorite network
browser as follows:

SunWeb Access:
______________

* Access the top level URL of http://sdpsweb.Central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.

SunSolve Online Access:
_______________________

* Access the SunSolve Online URL at http://sunsolve.Central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
_______________

* Access the top level URL of https://spe.sun.com

--------------------------------------------------------------------------
General:
________

Send questions or comments to finfco-manager@sun.com

---------------------------------------------------------------------------
Statusinactive