Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1000783.1
Update Date:2010-09-10
Keywords:

Solution Type  FAB (standard) Sure

Solution  1000783.1 :   FCO A0279-1: Sun Fire T1000 with a Rhea HBA installed may experience Data Loss, Data Corruption and System Hangs.  


Related Items
  • Sun Fire T1000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Hardware Remediation>Mandatory
  •  

PreviouslyPublishedAs
201060


Product
Sun Fire T1000 Server

Bug Id
<SUNBUG: 6503429>

Part
  • Part No: 375-3357-02/03
  • Part Description: PCI Express - Dual Ultra-320 SCSI (Rhea HBA)
Xoption
  • Xoption Number: SG-XPCIE2SCSIU320Z
  • Xoption Description: Dual-Channel Ultra320 LVD SCSI PCI Express Adapter

Impact

Dual Ultra-320 SCSI Rhea card (Sun p/n 375-3357-02/03), when installed in a Sun Fire T1000, generates excessive RTO and LUP events as a result of PCI-Express specification violation.  These events may result in lost data, data corruption and system hangs.


Contributing Factors

This issue only arises when a -02 or -03 Rhea card (p/n 375-3357) is plugged into a T1000 system.

The easiest way to expose the issue is to run the storage tests included in SunVTS.  If a customer has a revision -02 or -03 Rhea card, there will be immediate warnings about excessive RTO and LUP events causing the message queue to overflow and ereports to drop.


Symptoms

Excessive RTO and LUP events which may result in lost data, data corruption and system hangs.

With a load on the HBA using SunVTS HBA test or applications doing read and/or writes to the storage device(s) attached to the HBA, the appearance of the error messages and the hang condition accelerates.

The "Dropping ereports" messages would appear first, followed by a system hang.

Sun Fire T1000 systems will show the following error messages to the console and saved to messages files:

"SC Alert: Dropping ereports, message queue is full."
...
Nov 01 08:05:06: 00060029: "Dropping ereports, message queue is full."
Nov 01 08:05:08: 00060029: "Dropping ereports, message queue is full."
Nov 01 08:05:09: 00060029: "Dropping ereports, message queue is full."
...

A dump of the fma logs after rebooting the hung machine shows PCI-E retraining (Replay Timeout RTO) and PCI-E link up (LUP) events when the hang occurs:

Nov 01 12:40:30.357937920 ereport.io.fire.pec.rto
Nov 01 12:40:30.404282880 ereport.io.fire.pec.rto
Nov 01 12:05:03.389229600 ereport.io.fire.pec.lup
Nov 01 12:05:03.389229600 ereport.io.fire.pec.lup
Nov 01 12:40:43.300991360 ereport.io.fire.pec.rto
Nov 01 12:40:44.359243200 ereport.io.fire.pec.rto
...

Root Cause

A bug has been identified where the Rhea card causes a lot of RTO and LUP events when installed in T1000 servers.  The root cause is that the Rhea card does not return the expected ACK DLLP on the PCI Express link back to the Fire chip on the T1000 within the correct number of clocks and violates the PCI Express specification.  No other SPARC or AMD systems have shown sensitivity to this issue.

Corrective action was made available in Manufacturing via a dash roll or the Rhea card to -04 per ECO# WO_35411 as of March 1, 2007.

Corrective action was made available in Services via GSAP# 3881 as of March 13, 2007.

 

Implementation Target Completion Dates:

AMER: September 12, 2008
APAC: September 12, 2008
EMEA: September 12, 2008

Replacement Time Estimate: 30 minutes


Resolution

Hot Swappable: No

Proactively contact customers showing both the Rhea HBA and a Sun Fire T1000 on the same Sales Order and proactively replace the 375-3357-02 or -03 with a 375-3357-04 (or above).

For all other customers owning a Rhea HBA with part number 375-3357-02 or -03, replace upon failure or at customer request with part number 375-3357-04 (or above), but only after ascertaining the replacement Rhea HBA is to be installed into a T1000.

For Rhea HBAs installed in other than a T1000, no action is required.

For the Mandatory portion of this FCO, the Customer List is available on SunFIT;

   http://sunfit/

In addition, a Sun Legal approved Customer Letter is available via the below URL;

  http://sdpsweb.central/FIN_FCO/FCO/A0279-1/SPE/Rhea_HBA-Customer_Letter.sxw

How to Determine if a Rhea HBA is Installed:

To determine if a Rhea HBA is installed on a T1000 system, the following command can be run (look for PCI-E slot 0 for the presence of LSI,1030):

$ prtdiag -v
Location    Type  Slot     Path
------------------------- ---------
MB/PCIE0   PCIE      0     /pci@780/pci@0/scsi@8
scsi-pci1000,30  LSI,1030
MB/PCIE0   PCIE      0     /pci@780/pci@0/scsi@8,1
scsi-pci1000,30  LSI,1030
...

If a Rhea HBA is present, it's dash level must then be visually inspected.  Only dash levels -02 and -03 are impacted by this issue.


Modification History
Date: 14-SEP-2007
  •  Added "How to Determine..." section.  Added readiness statement under "Hardware Remediation and Material Availability Details" section.   Added Hotswap statement in "Resolution" section.  Added additional details in "Symptoms" section.


Previously Published As
103020
Internal Comments


Sun Alert 102991 is being published to inform customers of this issue/resolution.



For questions or feedback on this asset, send email to FCO_279_Discussion@sunwebcollab.east.sun.com



Hardware Remediation Details

All Regions were materially ready at the time of this asset's publication.


Related Information
  • Other: Sun Alert 102991

Internal Contributor/submitter
Tonya.Flynn@Sun.COM

Internal Eng Business Unit Group
KE Authors

Internal Eng Responsible Engineer
Rodger.Wilson@Sun.COM, Deanna.Demarco@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Escalation ID
1-20665197

Internal Kasp FAB Legacy ID
103020

Internal Sun Alert & FAB Admin Info
Critical Category:
Significant Change Date: 2007-09-13
Avoidance: Hardware Replacement
Responsible Manager: David.Palmer@Sun.COM
Original Admin Info: WF - Finalizing draft for Extended Review. - Joe 7/25/07
WF - made changes per input from TZ review. - Joe 8/2/07
WF - minor mods - awtg Material Ready. - Joe 8/7/07
WF - added discussion alias to Comments section. - Joe 8/8/07
WF - US, Ltn Am and EMEA are Materially Ready, and APac should be
ready by 9/7/07, but TZs want until Tuesday to perform a final
review on FAB and Sun Alert KE wants same time to send the
related Sun Alert thru final review. Will Pub on 9/4. - Joe 8/31/07
WF - sdpsweb.central is finally back up after 10 days, so I can now
publish this FCO. - Joe 9/13/07
Product_uuid
79ad78b9-961d-11d9-9adf-080020a9ed93|Sun Fire T1000 Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback