Document Audience:	INTERNAL
Document ID:	A0222-1
Title:	Seagate Cheetah 4 FCAL and SCSI disk drives may experience higher than expected failure rate.
Copyright Notice:	Copyright © 2007 Sun Microsystems, Inc. All Rights Reserved
Update Date:	Wed Jan 04 00:00:00 MST 2006

__________________________________________________________________

***  Sun Confidential:  Internal Use and Authorized VARs Only  ***
__________________________________________________________________

This message including any attachments is confidential information
of Sun Microsystems, Inc.  Disclosure, copying or distribution is
prohibited without permission of Sun.  If you are not the intended
recipient, please reply to the sender and then delete this message.
__________________________________________________________________

                             FIELD CHANGE ORDER
            (For Authorized Distribution by Enterprise Services)

FCO #: A0222-1

Status: inactive

Synopsis: Seagate Cheetah 4 FCAL and SCSI disk drives may experience higher than expected failure rate.

Date: Jan/04/2006

SunAlert: No

Top FIN/FCO Report: No

Products Reference: Seagate Cheetah 4 FCAL and SCSI disk drives

Product Category: Storage / Disk

Product Affected:

Systems Affected:

Mkt_ID   Platform   Model   Description        
------   --------   -----   -----------        
-        ALL        -       System Platform Independent     


X-Options Affected:

Mkt_ID      Platform         Model    Description       
------      --------         -----    -----------	
-           T3                ALL     StorEdge T3              
-           T3+               ALL     StorEdge T3+              
-           D240              ALL     StorEdge D240             
-           D1000             ALL     StorEdge D1000            
-           MultiPack         ALL     StorEdge MultiPack        
-           UniPack           ALL     StorEdge UniPack          
-           st A1000/D1000    ALL     Netra st A1000/D1000      
-           ct 400/800        ALL     Netra ct 400/800          
-           st D130           ALL     Netra st D130             
-           A5100             ALL     StorEdge A5100            
-           A5200             ALL     StorEdge A5200            
-           A5000             ALL     StorEdge A5000

Parts Affected:

Part Number  Description   Model    Type  Vendor   Firmware 	     
-----------  -----------   -----    ----  ------   --------
540-4519-XX  73GB FCAL  ST173404FC  Disk  Seagate  A726
540-4191-XX  18GB FCAL  ST318304FC  Disk  Seagate  0726
540-4673-XX  18GB FCAL  ST318304FC  Disk  Seagate  0726
540-4440-XX  18GB FCAL  ST318304FC  Disk  Seagate  A726
540-4177-XX  18GB SCSI  ST318404LC  Disk  Seagate  4203
540-4178-XX  18GB SCSI  ST318404LC  Disk  Seagate  4203
540-4401-XX  18GB SCSI  ST318404LC  Disk  Seagate  4203
390-0038-XX  18GB SCSI  ST318404LC  Disk  Seagate  4203
540-4367-XX  36GB FCAL  ST336704FC  Disk  Seagate  A726
540-4525-XX  36GB FCAL  ST336704FC  Disk  Seagate  0726
540-3881-XX  9GB SCSI   ST39204LC   Disk  Seagate  4203
390-0037-XX  9GB SCSI   ST39204LC   Disk  Seagate  4203
540-3966-XX  9GB SCSI   ST39204LC   Disk  Seagate  4203

References:

ESC:  544378, 543716, 543461                      
   FCO:  A0199-2     
   PatchID:  113667-01 or later
   PatchID:  113668-04 or later
   DPCO:  390

Issue Description:

Seagate Cheetah 4 FCAL and SCSI disk drives may experience higher than
expected failure rate.  The experienced failure mode is Offline or Not Ready.

The typical scenario involves drives that go into an "idle" state with power
applied.  A small percentage of these drives may have experienced this failure
already, but it will not be evident until a data transfer is initiated.  Until
then the failed drive appears to be working and may even respond to a Test
Unit Ready command.

Note: Due to the operating profile of internal system drives (minimal
      idle time), these drives are minimally exposed to this issue.


The following are typical T3 Purple Error Messages:

  May 07 22:42:29 ISR1[1]: N: u1d6 SVD_RETRY: Retries Exhausted (ccb =
  0x1384034)
  May 07 22:42:29 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x4
  May 07 22:42:29 ANNT[1]: W: u1d6: Failed
  May 07 22:42:31 LPCT[1]: N: u1d6: Bypassed on loop 1
  May 07 22:42:31 ISR1[1]: N: u1d6 SVD_RETRY: Retries Exhausted (ccb =
  0x13918f4)
  May 07 22:42:31 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x4
  May 07 22:42:31 snmp[1]: N: u1d6 ioctl disk failed err=1
  May 07 22:42:31 LPCT[1]: N: u1d6: Bypassed on loop 2

  Mar 23 23:01:16 ISR1[2]: N: u1d7 SVD_DONE: Command Error = 0x3
  Mar 23 23:01:16 ISR1[2]: W: u1d7 SCSI Disk Error Occurred (path = 0x0)
  Mar 23 23:01:16 ISR1[2]: W: Sense Key = 0x2, Asc = 0x4, Ascq = 0x2
  Mar 23 23:01:16 ISR1[2]: W: Sense Data Description = Logical Unit Not Ready,

The following are typical A5x00 Photon error messages:

  socal3: port 0: Fibre Channel is OFFLINE
  WARNING: /sbus@b,0/SUNW,socal@0,0/sf@1,0/ssd@w2100002037aef8c8,0 (ssd164):
  SCSI transport failed: reason 'tran_err': retrying command

  socal3: port 1: Fibre Channel is OFFLINE
  WARNING: /sbus@3,0/SUNW,socal@0,0/sf@1,0/ssd@w21000020378837f2,0 (ssd171):
  SCSI transport failed: reason 'timeout': giving up

  May 20 10:59:58 gmi2wap011 scsi: [ID 107833 kern.notice] Sense Key: Not
  Ready
  May 20 10:59:58 gmi2wap011 scsi: [ID 107833 kern.notice] ASC: 0x4 (), ASCQ:
  0x1, FRU: 0x2


The root cause of this issue is due to low resistance; intermittent shorts 
between adjacent pins in flash ROM memory chips that use a red phosphorous 
inorganic fire retardant molding compound.  These shorts are caused by 
electro-migration of metal across adjacent pins of these memory chips.

All field RSLs were purged via DPCO 390 starting on May 29, 2003.  Corrective
action was made available by releasing patches 113668-04 (or later) and
113667-01 (or later).

Implementation:

---
|   |   MANDATORY (Fully Pro-Active)
 ---

 ---
|   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
 ---

 ---
| X |   UPON FAILURE
 ---

Replacement Time Estimate:

2.0 hours

Special Considerations:

A. How to determine potential customer exposure:

Not all failures can be tied to this failure mode.  Prior to implementing
any fix, the account team must determine two factors;

  1. Determine that the customer is seeing a higher than expected
     replacement/pull rate.

  2. Determine the failure mode for suspect failed drives via the
     CPAS process. Submit no more than five (5) drives for this process
     per CPAS.

B. Assuming your customer is seeing a higher than expected failure rate,
   and CPAS results indicate failures are due to failed Atmel chips, this
   FCO explains how to install new firmware to address the issue.

The magnitude of exposure for this failure mechanism is very small and
customers may not see it at all.  The drive's specification for AFR (Annual
Failure Rate) is 1.10% which equals 800K hours MTBF, and the drive's AFR is
currently measured in the field at just over 1.10%.

As such, we are recommending that the account team install FCO firmware only
after CPAS results indicate drives are failing due to Atmel chip failures.

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned issue.

Please adhere to the following guidelines:

Proceed only if the customer site is seeing Unit Not Ready or Not Ready
failures AND reporting a high replacement/pull rate of the above disk drive
part numbers (Seagate Cheetah 4 only).

1. If the site is experiencing 'Drive Not Ready' along with a high
   pull rate, route all failed HDD disks through the Sun Customer
   Quality CPAS (Corrective & Preventive Action System) process for
   failure analysis.  The CPAS request form must be filled out
   completely and accompanied by explorer scripts (reference Note
   below). Submit no more than five (5) drives for this process.

   Information on the CPAS Program can be found at the following webpage:

      http://gsops.central/mcrcca/CPAS/

2. Open an escalation with PTS to establish an action plan.

The Action Plan is to install drive specific Firmware only if the following
conditions are triggered:

For all products:

   a) Through the CPAS process, the root cause is identified as the
      "Atmel chip" failure mode.

NOTE:  Open a CPAS request only if the following criteria are met.  Include
this information in the CPAS request.  CPAS requests without this failure
information will be rejected.

   b) High Pull rate and DNR "drive not ready" being experienced.

NOTE:  A high pull rate is defined as greater than 3% Annual Pull Rate.
Annual Pull Rate is calculated by forecasting at least 3 months failure data
over an annual basis.  Example, 15 drives are pulled for replacement over a
three month period.  This equals an average of 5 drives per month.  Multiply 
5 by an annual basis; thus 5 x 12 = 60.

   c) Compare expected to forecasted.  Example, forecasted = 60 drive pulls.

To determine 3% annual pull rate.  Divide total Cheetah 4 population by 3%.
Example, install base = 1,000 Cheetah 4 drives.  1,000 x 3% = 30.  This is the
expected annual pull rate.

   d) Compare expected with actual.

In this example, we would expect 30.  And have forecasted 60.  The site is
experiencing a high pull rate; 60 versus expected 30.


Additional Notes:

The FW provided via this FCO will "wake up" and fail disk drives that have
been in an idle-failed state.  Likewise the "wake up" algorithm is designed to
shake out "latent" failures (that is, failed drives that have not yet reported
failures).  The very act of "waking up" the drive population and downloading
new firmware will, in addition to identifying latent failed drives, accelerate
the failure of those drives that are marginal.  While no definite time line to
failure has been identified for these "marginal" drives, there will be a
period of time during which drives will continue to fail.  The rate at which
they fail will be significantly lower than the pre-FCO failure rate, but still
above expected norms.  This rate could last for a few months as the
accelerated failures continue to appear.  The failure rate is then expected to
return to a more normal level for the remainder of the disk population's life.


---------------------------------
PLAN/PREPARE FOR FIRMWARE INSTALL
---------------------------------

A. INSTALL NEW FIRMWARE - PatchIDs

   1. Ensure the number of spares required to support the firmware
      install are available and on-site (plan for 2% fallout).

   2. Install Cheetah 4 disk drive FW as follows:

      FCAL: 113668-04 (or later)
      SCSI: 113667-01 (or later)

B. IDENTIFY ANY LATENT (HIDDEN) FAILED DRIVES / MONITOR FOR ERRORS

Monitor systems for reported drive failures and replace failed drives as
required, and rerun the FW script to update the firmware of any replaced
drives.

Only required for any replaced drives.

C. VERIFY PROPER OPERATION OF SYSTEM

Restore target configuration and data (if neccessary), and verify proper
operation of system.


---------------------------
FIRMWARE INSTALLATION NOTES
---------------------------

When the FW script is finished you can examine the log file:

   /"patch id"/fco/logs

Looking at the log file you may discover unexpected error messages.  Don't be
alarmed.  One of the scripts used by this patch sets ssd_error_level system
variable using adb to temporarily log more messages.  This results in the
system capturing all sense key messages.  The script turns off debugging, so
you will not see these errors in normal conditions.  See bug 4758975 for more
info.

Since the error level is turned on to report every sense key, you have to
filter errors out.  For information on which errors you should pay attention
to, refer to the following online document:

   
http://webhome.central/harish/cgi-bin/help.cgi?SEARCH_STRING=cdb+failed+decoding
+scsi+sense&GoBtn
=GO

Looking at the above link, you can decode sense data.

   - Sense key 1 & 2,  These are info only sense key messages.

   - Sense key 3 & 4 are more severe, especially sense key 4 (Hardware error).
     Also look at the Error level: Fatal or Retryable.

You should identify disks that have sense key 4's.  Since these errors are
seen during the upgrade process, we can't tell if they are pre-upgrade or
post-upgrade.  After completing the upgrade, run dd or dex to do some heavy
IO to these disks and see if the same disks are reporting any sense key 3 or
4.  Since the error logging is turned off, you will not see extra errors, i.e.
failed CDB.

Now use the procedures described in FCO to determine if this site is still
having drive issues.

---------------------------
Long Term Monitoring Action
---------------------------

A. MONITOR THE SITE FOR CHEETAH 4 DRIVE FAILURES.

B. SWAP DRIVES (IF REQUIRED).

C. SUBMIT DRIVES FOR CPAS AND OPEN AN ESCALATION IF FAILURE
   RATE REMAINS ABOVE EXPECTED LEVELS.

Use the table below to evaluate if drive failure rates are above expectations:

                 |---------------------------|
                 | # of drives    |  2 months|
                 |  on site       |          |
                 |============================
                 |     300        |    9     |
                 |----------------|----------|
                 |     400        |   12     |
                 |----------------|----------|
                 |     500        |   15     |
                 |----------------|----------|
                 |    1,000       |   30     |
                 |----------------|----------|
                 |    1,500       |   45     |
                 |----------------|----------|
                 |total  failures |          |
                 |----------------------------

Corrective Action:

IMPORTANT! Please follow the Disk Drive FW Install process
listed above prior to implementing any other corrective action.

Implementation of this FCO must be performed in two steps:

1. Implement "Controlled Proactive" patch installation using
   the instructions in the SPECIAL CONSIDERATIONS section above.

2. Replace affected failed disks, part numbers listed above in
   PROBLEM DESCRIPTION, with ones that have a DPCO 390.A label.

3. For defective drives NOT being routed through the CPAS process, ensure
   to write "FCO A0222-1" on the Defective Material Tag (DMT) for quicker
   processing.

Comments:

None

Change History 
--------------

Date Modified: Jan/04/2006
Updates: AFFECTED PARTS, Change History, Foter
. AFFECTED PARTS: Some drives had incorrect firmware levels listed.  All
                  were listed as 0726.  This information has been corrected.
. Change History: Moved from beginning of ISSUE DESCRIPTION section to below
                  the COMMENTS section.
. Footer: Updated footer section to latest version.

Date Modified: Sep/01/2004
Updates: AFFECTED PARTS
. AFFECTED PARTS: removed model ST336704LC, part numbers 540-4521, 540-4520,
		  540-4689, and 390-0050 - firmware patch not developed for
		  this model as product is meeting all reliability goals.

Date Modified: Apr/07/2004
Updates: AFFECTED PARTS
. AFFECTED PARTS: chgd 540-4191-XX model number from ST318203FC to ST318304FC

________________________________________________________________________

NOTE: FCO Tracking Instructions for Radiance/SPWeb:
--------------------------------------------------

If a Radiance case involves the application of an FCO to solve a customer
issue, please complete the following steps in Radiance/SPWeb prior to
closing the case:
 
    o Select "Field Change Order" in the REFERENCE TYPE field.

    o Enter FCO ID number in the REFERENCE ID field.
      For example; A0222-1.

If possible, include additional details in the REFERENCE SUMMARY field
(ie. Upgrade complete, customer declined, etc.)
________________________________________________________________________

Implementation Notes
--------------------

In case of "Mandatory" FCOs, Sun Services will attempt to contact
all known customers to recommend proactive implementation.

For "Controlled Proactive" FCOs, Sun Services mission critical
support teams will initiate proactive implementation efforts for
their respective accounts, as required.

For "Upon Failure" FCOs, Sun Services and partners will implement
the necessary corrective actions as the need arises.

The CIC process must be used for proactive hardware replacement
requests when an FCO is classified as "Upon Failure".


Billing Information
-------------------

Warranty: Sun will provide parts at no charge under Warranty
          Service.  On-Site Labor Rates are based on specified
          Warranty deliverables for the affected product.

Contract: Sun will provide parts at no charge.  On-Site Labor Rates
          are based on the type of service contract.

Non Contract: Sun will provide parts at no charge.  Installation by
              Sun is available based on the On-Site Labor Rates
              defined in the Price List.

________________________________________________________________________

All FCO documents are accessible via Internal SunSolve.  Type "sunsolve"
in a browser and follow the prompts to Search Collections.

For questions on this document, please email:

        finfco-manager@Sun.com

The FCO homepage is available at:

        http://tns.central/FCO/

For more information on how to submit a FCO, go to:

        http://pronto.central/fco.html

To access the Service Partner Exchange, use:

        https://spe.sun.com

________________________________________________________________________

Status

inactive