Document Audience: | INTERNAL |
Document ID: | A0222-1 |
Title: | Seagate Cheetah 4 FCAL and SCSI disk drives may experience higher than expected failure rate. |
Copyright Notice: | Copyright © 2007 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | Wed Jan 04 00:00:00 MST 2006 |
__________________________________________________________________
*** Sun Confidential: Internal Use and Authorized VARs Only ***
__________________________________________________________________
This message including any attachments is confidential information
of Sun Microsystems, Inc. Disclosure, copying or distribution is
prohibited without permission of Sun. If you are not the intended
recipient, please reply to the sender and then delete this message.
__________________________________________________________________
FIELD CHANGE ORDER
(For Authorized Distribution by Enterprise Services)
FCO #: A0222-1
Status: inactive
Synopsis: Seagate Cheetah 4 FCAL and SCSI disk drives may experience higher than expected failure rate.Date: Jan/04/2006
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Seagate Cheetah 4 FCAL and SCSI disk drives
Product Category: Storage / Disk
Product Affected:
Systems Affected:
Mkt_ID Platform Model Description
------ -------- ----- -----------
- ALL - System Platform Independent
X-Options Affected:
Mkt_ID Platform Model Description
------ -------- ----- -----------
- T3 ALL StorEdge T3
- T3+ ALL StorEdge T3+
- D240 ALL StorEdge D240
- D1000 ALL StorEdge D1000
- MultiPack ALL StorEdge MultiPack
- UniPack ALL StorEdge UniPack
- st A1000/D1000 ALL Netra st A1000/D1000
- ct 400/800 ALL Netra ct 400/800
- st D130 ALL Netra st D130
- A5100 ALL StorEdge A5100
- A5200 ALL StorEdge A5200
- A5000 ALL StorEdge A5000
Parts Affected:
Part Number Description Model Type Vendor Firmware
----------- ----------- ----- ---- ------ --------
540-4519-XX 73GB FCAL ST173404FC Disk Seagate A726
540-4191-XX 18GB FCAL ST318304FC Disk Seagate 0726
540-4673-XX 18GB FCAL ST318304FC Disk Seagate 0726
540-4440-XX 18GB FCAL ST318304FC Disk Seagate A726
540-4177-XX 18GB SCSI ST318404LC Disk Seagate 4203
540-4178-XX 18GB SCSI ST318404LC Disk Seagate 4203
540-4401-XX 18GB SCSI ST318404LC Disk Seagate 4203
390-0038-XX 18GB SCSI ST318404LC Disk Seagate 4203
540-4367-XX 36GB FCAL ST336704FC Disk Seagate A726
540-4525-XX 36GB FCAL ST336704FC Disk Seagate 0726
540-3881-XX 9GB SCSI ST39204LC Disk Seagate 4203
390-0037-XX 9GB SCSI ST39204LC Disk Seagate 4203
540-3966-XX 9GB SCSI ST39204LC Disk Seagate 4203
References:
ESC: 544378, 543716, 543461
FCO: A0199-2
PatchID: 113667-01 or later
PatchID: 113668-04 or later
DPCO: 390
Issue Description:
Seagate Cheetah 4 FCAL and SCSI disk drives may experience higher than
expected failure rate. The experienced failure mode is Offline or Not Ready.
The typical scenario involves drives that go into an "idle" state with power
applied. A small percentage of these drives may have experienced this failure
already, but it will not be evident until a data transfer is initiated. Until
then the failed drive appears to be working and may even respond to a Test
Unit Ready command.
Note: Due to the operating profile of internal system drives (minimal
idle time), these drives are minimally exposed to this issue.
The following are typical T3 Purple Error Messages:
May 07 22:42:29 ISR1[1]: N: u1d6 SVD_RETRY: Retries Exhausted (ccb =
0x1384034)
May 07 22:42:29 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x4
May 07 22:42:29 ANNT[1]: W: u1d6: Failed
May 07 22:42:31 LPCT[1]: N: u1d6: Bypassed on loop 1
May 07 22:42:31 ISR1[1]: N: u1d6 SVD_RETRY: Retries Exhausted (ccb =
0x13918f4)
May 07 22:42:31 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x4
May 07 22:42:31 snmp[1]: N: u1d6 ioctl disk failed err=1
May 07 22:42:31 LPCT[1]: N: u1d6: Bypassed on loop 2
Mar 23 23:01:16 ISR1[2]: N: u1d7 SVD_DONE: Command Error = 0x3
Mar 23 23:01:16 ISR1[2]: W: u1d7 SCSI Disk Error Occurred (path = 0x0)
Mar 23 23:01:16 ISR1[2]: W: Sense Key = 0x2, Asc = 0x4, Ascq = 0x2
Mar 23 23:01:16 ISR1[2]: W: Sense Data Description = Logical Unit Not Ready,
The following are typical A5x00 Photon error messages:
socal3: port 0: Fibre Channel is OFFLINE
WARNING: /sbus@b,0/SUNW,socal@0,0/sf@1,0/ssd@w2100002037aef8c8,0 (ssd164):
SCSI transport failed: reason 'tran_err': retrying command
socal3: port 1: Fibre Channel is OFFLINE
WARNING: /sbus@3,0/SUNW,socal@0,0/sf@1,0/ssd@w21000020378837f2,0 (ssd171):
SCSI transport failed: reason 'timeout': giving up
May 20 10:59:58 gmi2wap011 scsi: [ID 107833 kern.notice] Sense Key: Not
Ready
May 20 10:59:58 gmi2wap011 scsi: [ID 107833 kern.notice] ASC: 0x4 (), ASCQ:
0x1, FRU: 0x2
The root cause of this issue is due to low resistance; intermittent shorts
between adjacent pins in flash ROM memory chips that use a red phosphorous
inorganic fire retardant molding compound. These shorts are caused by
electro-migration of metal across adjacent pins of these memory chips.
All field RSLs were purged via DPCO 390 starting on May 29, 2003. Corrective
action was made available by releasing patches 113668-04 (or later) and
113667-01 (or later).
Implementation:
---
| | MANDATORY (Fully Pro-Active)
---
---
| | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| X | UPON FAILURE
---
Replacement Time Estimate:
2.0 hours
Special Considerations:
A. How to determine potential customer exposure:
Not all failures can be tied to this failure mode. Prior to implementing
any fix, the account team must determine two factors;
1. Determine that the customer is seeing a higher than expected
replacement/pull rate.
2. Determine the failure mode for suspect failed drives via the
CPAS process. Submit no more than five (5) drives for this process
per CPAS.
B. Assuming your customer is seeing a higher than expected failure rate,
and CPAS results indicate failures are due to failed Atmel chips, this
FCO explains how to install new firmware to address the issue.
The magnitude of exposure for this failure mechanism is very small and
customers may not see it at all. The drive's specification for AFR (Annual
Failure Rate) is 1.10% which equals 800K hours MTBF, and the drive's AFR is
currently measured in the field at just over 1.10%.
As such, we are recommending that the account team install FCO firmware only
after CPAS results indicate drives are failing due to Atmel chip failures.
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned issue.
Please adhere to the following guidelines:
Proceed only if the customer site is seeing Unit Not Ready or Not Ready
failures AND reporting a high replacement/pull rate of the above disk drive
part numbers (Seagate Cheetah 4 only).
1. If the site is experiencing 'Drive Not Ready' along with a high
pull rate, route all failed HDD disks through the Sun Customer
Quality CPAS (Corrective & Preventive Action System) process for
failure analysis. The CPAS request form must be filled out
completely and accompanied by explorer scripts (reference Note
below). Submit no more than five (5) drives for this process.
Information on the CPAS Program can be found at the following webpage:
http://gsops.central/mcrcca/CPAS/
2. Open an escalation with PTS to establish an action plan.
The Action Plan is to install drive specific Firmware only if the following
conditions are triggered:
For all products:
a) Through the CPAS process, the root cause is identified as the
"Atmel chip" failure mode.
NOTE: Open a CPAS request only if the following criteria are met. Include
this information in the CPAS request. CPAS requests without this failure
information will be rejected.
b) High Pull rate and DNR "drive not ready" being experienced.
NOTE: A high pull rate is defined as greater than 3% Annual Pull Rate.
Annual Pull Rate is calculated by forecasting at least 3 months failure data
over an annual basis. Example, 15 drives are pulled for replacement over a
three month period. This equals an average of 5 drives per month. Multiply
5 by an annual basis; thus 5 x 12 = 60.
c) Compare expected to forecasted. Example, forecasted = 60 drive pulls.
To determine 3% annual pull rate. Divide total Cheetah 4 population by 3%.
Example, install base = 1,000 Cheetah 4 drives. 1,000 x 3% = 30. This is the
expected annual pull rate.
d) Compare expected with actual.
In this example, we would expect 30. And have forecasted 60. The site is
experiencing a high pull rate; 60 versus expected 30.
Additional Notes:
The FW provided via this FCO will "wake up" and fail disk drives that have
been in an idle-failed state. Likewise the "wake up" algorithm is designed to
shake out "latent" failures (that is, failed drives that have not yet reported
failures). The very act of "waking up" the drive population and downloading
new firmware will, in addition to identifying latent failed drives, accelerate
the failure of those drives that are marginal. While no definite time line to
failure has been identified for these "marginal" drives, there will be a
period of time during which drives will continue to fail. The rate at which
they fail will be significantly lower than the pre-FCO failure rate, but still
above expected norms. This rate could last for a few months as the
accelerated failures continue to appear. The failure rate is then expected to
return to a more normal level for the remainder of the disk population's life.
---------------------------------
PLAN/PREPARE FOR FIRMWARE INSTALL
---------------------------------
A. INSTALL NEW FIRMWARE - PatchIDs
1. Ensure the number of spares required to support the firmware
install are available and on-site (plan for 2% fallout).
2. Install Cheetah 4 disk drive FW as follows:
FCAL: 113668-04 (or later)
SCSI: 113667-01 (or later)
B. IDENTIFY ANY LATENT (HIDDEN) FAILED DRIVES / MONITOR FOR ERRORS
Monitor systems for reported drive failures and replace failed drives as
required, and rerun the FW script to update the firmware of any replaced
drives.
Only required for any replaced drives.
C. VERIFY PROPER OPERATION OF SYSTEM
Restore target configuration and data (if neccessary), and verify proper
operation of system.
---------------------------
FIRMWARE INSTALLATION NOTES
---------------------------
When the FW script is finished you can examine the log file:
/"patch id"/fco/logs
Looking at the log file you may discover unexpected error messages. Don't be
alarmed. One of the scripts used by this patch sets ssd_error_level system
variable using adb to temporarily log more messages. This results in the
system capturing all sense key messages. The script turns off debugging, so
you will not see these errors in normal conditions. See bug 4758975 for more
info.
Since the error level is turned on to report every sense key, you have to
filter errors out. For information on which errors you should pay attention
to, refer to the following online document:
http://webhome.central/harish/cgi-bin/help.cgi?SEARCH_STRING=cdb+failed+decoding
+scsi+sense&GoBtn
=GO
Looking at the above link, you can decode sense data.
- Sense key 1 & 2, These are info only sense key messages.
- Sense key 3 & 4 are more severe, especially sense key 4 (Hardware error).
Also look at the Error level: Fatal or Retryable.
You should identify disks that have sense key 4's. Since these errors are
seen during the upgrade process, we can't tell if they are pre-upgrade or
post-upgrade. After completing the upgrade, run dd or dex to do some heavy
IO to these disks and see if the same disks are reporting any sense key 3 or
4. Since the error logging is turned off, you will not see extra errors, i.e.
failed CDB.
Now use the procedures described in FCO to determine if this site is still
having drive issues.
---------------------------
Long Term Monitoring Action
---------------------------
A. MONITOR THE SITE FOR CHEETAH 4 DRIVE FAILURES.
B. SWAP DRIVES (IF REQUIRED).
C. SUBMIT DRIVES FOR CPAS AND OPEN AN ESCALATION IF FAILURE
RATE REMAINS ABOVE EXPECTED LEVELS.
Use the table below to evaluate if drive failure rates are above expectations:
|---------------------------|
| # of drives | 2 months|
| on site | |
|============================
| 300 | 9 |
|----------------|----------|
| 400 | 12 |
|----------------|----------|
| 500 | 15 |
|----------------|----------|
| 1,000 | 30 |
|----------------|----------|
| 1,500 | 45 |
|----------------|----------|
|total failures | |
|----------------------------
Corrective Action:
IMPORTANT! Please follow the Disk Drive FW Install process
listed above prior to implementing any other corrective action.
Implementation of this FCO must be performed in two steps:
1. Implement "Controlled Proactive" patch installation using
the instructions in the SPECIAL CONSIDERATIONS section above.
2. Replace affected failed disks, part numbers listed above in
PROBLEM DESCRIPTION, with ones that have a DPCO 390.A label.
3. For defective drives NOT being routed through the CPAS process, ensure
to write "FCO A0222-1" on the Defective Material Tag (DMT) for quicker
processing.
Comments:
None
Change History
--------------
Date Modified: Jan/04/2006
Updates: AFFECTED PARTS, Change History, Foter
. AFFECTED PARTS: Some drives had incorrect firmware levels listed. All
were listed as 0726. This information has been corrected.
. Change History: Moved from beginning of ISSUE DESCRIPTION section to below
the COMMENTS section.
. Footer: Updated footer section to latest version.
Date Modified: Sep/01/2004
Updates: AFFECTED PARTS
. AFFECTED PARTS: removed model ST336704LC, part numbers 540-4521, 540-4520,
540-4689, and 390-0050 - firmware patch not developed for
this model as product is meeting all reliability goals.
Date Modified: Apr/07/2004
Updates: AFFECTED PARTS
. AFFECTED PARTS: chgd 540-4191-XX model number from ST318203FC to ST318304FC
________________________________________________________________________
NOTE: FCO Tracking Instructions for Radiance/SPWeb:
--------------------------------------------------
If a Radiance case involves the application of an FCO to solve a customer
issue, please complete the following steps in Radiance/SPWeb prior to
closing the case:
o Select "Field Change Order" in the REFERENCE TYPE field.
o Enter FCO ID number in the REFERENCE ID field.
For example; A0222-1.
If possible, include additional details in the REFERENCE SUMMARY field
(ie. Upgrade complete, customer declined, etc.)
________________________________________________________________________
Implementation Notes
--------------------
In case of "Mandatory" FCOs, Sun Services will attempt to contact
all known customers to recommend proactive implementation.
For "Controlled Proactive" FCOs, Sun Services mission critical
support teams will initiate proactive implementation efforts for
their respective accounts, as required.
For "Upon Failure" FCOs, Sun Services and partners will implement
the necessary corrective actions as the need arises.
The CIC process must be used for proactive hardware replacement
requests when an FCO is classified as "Upon Failure".
Billing Information
-------------------
Warranty: Sun will provide parts at no charge under Warranty
Service. On-Site Labor Rates are based on specified
Warranty deliverables for the affected product.
Contract: Sun will provide parts at no charge. On-Site Labor Rates
are based on the type of service contract.
Non Contract: Sun will provide parts at no charge. Installation by
Sun is available based on the On-Site Labor Rates
defined in the Price List.
________________________________________________________________________
All FCO documents are accessible via Internal SunSolve. Type "sunsolve"
in a browser and follow the prompts to Search Collections.
For questions on this document, please email:
finfco-manager@Sun.com
The FCO homepage is available at:
http://tns.central/FCO/
For more information on how to submit a FCO, go to:
http://pronto.central/fco.html
To access the Service Partner Exchange, use:
https://spe.sun.com
________________________________________________________________________