Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1000783.1 : FCO A0279-1: Sun Fire T1000 with a Rhea HBA installed may experience Data Loss, Data Corruption and System Hangs.
PreviouslyPublishedAs 201060 Product Sun Fire T1000 Server Bug Id <SUNBUG: 6503429> Part
Impact Dual Ultra-320 SCSI Rhea card (Sun p/n 375-3357-02/03), when installed in a Sun Fire T1000, generates excessive RTO and LUP events as a result of PCI-Express specification violation. These events may result in lost data, data corruption and system hangs. Contributing Factors This issue only arises when a -02 or -03 Rhea card (p/n 375-3357) is plugged into a T1000 system. The easiest way to expose the issue is to run the storage tests included in SunVTS. If a customer has a revision -02 or -03 Rhea card, there will be immediate warnings about excessive RTO and LUP events causing the message queue to overflow and ereports to drop. Symptoms Excessive RTO and LUP events which may result in lost data, data corruption and system hangs. With a load on the HBA using SunVTS HBA test or applications doing read and/or writes to the storage device(s) attached to the HBA, the appearance of the error messages and the hang condition accelerates. The "Dropping ereports" messages would appear first, followed by a system hang. Sun Fire T1000 systems will show the following error messages to the console and saved to messages files: "SC Alert: Dropping ereports, message queue is full." ... Nov 01 08:05:06: 00060029: "Dropping ereports, message queue is full." Nov 01 08:05:08: 00060029: "Dropping ereports, message queue is full." Nov 01 08:05:09: 00060029: "Dropping ereports, message queue is full." ... A dump of the fma logs after rebooting the hung machine shows PCI-E retraining (Replay Timeout RTO) and PCI-E link up (LUP) events when the hang occurs: Nov 01 12:40:30.357937920 ereport.io.fire.pec.rto Nov 01 12:40:30.404282880 ereport.io.fire.pec.rto Nov 01 12:05:03.389229600 ereport.io.fire.pec.lup Nov 01 12:05:03.389229600 ereport.io.fire.pec.lup Nov 01 12:40:43.300991360 ereport.io.fire.pec.rto Nov 01 12:40:44.359243200 ereport.io.fire.pec.rto ... Root Cause A bug has been identified where the Rhea card causes a lot of RTO and LUP events when installed in T1000 servers. The root cause is that the Rhea card does not return the expected ACK DLLP on the PCI Express link back to the Fire chip on the T1000 within the correct number of clocks and violates the PCI Express specification. No other SPARC or AMD systems have shown sensitivity to this issue. Corrective action was made available in Manufacturing via a dash roll or the Rhea card to -04 per ECO# WO_35411 as of March 1, 2007. Corrective action was made available in Services via GSAP# 3881 as of March 13, 2007.
Implementation Target Completion Dates: AMER: September 12, 2008 APAC: September 12, 2008 EMEA: September 12, 2008 Replacement Time Estimate: 30 minutes Resolution Hot Swappable: No Proactively contact customers showing both the Rhea HBA and a Sun Fire T1000 on the same Sales Order and proactively replace the 375-3357-02 or -03 with a 375-3357-04 (or above). For all other customers owning a Rhea HBA with part number 375-3357-02 or -03, replace upon failure or at customer request with part number 375-3357-04 (or above), but only after ascertaining the replacement Rhea HBA is to be installed into a T1000. For Rhea HBAs installed in other than a T1000, no action is required. For the Mandatory portion of this FCO, the Customer List is available on SunFIT; In addition, a Sun Legal approved Customer Letter is available via the below URL; http://sdpsweb.central/FIN_FCO/FCO/A0279-1/SPE/Rhea_HBA-Customer_Letter.sxw How to Determine if a Rhea HBA is Installed: To determine if a Rhea HBA is installed on a T1000 system, the following command can be run (look for PCI-E slot 0 for the presence of LSI,1030): $ prtdiag -v Location Type Slot Path ------------------------- --------- MB/PCIE0 PCIE 0 /pci@780/pci@0/scsi@8 scsi-pci1000,30 LSI,1030 MB/PCIE0 PCIE 0 /pci@780/pci@0/scsi@8,1 scsi-pci1000,30 LSI,1030 ... If a Rhea HBA is present, it's dash level must then be visually inspected. Only dash levels -02 and -03 are impacted by this issue. Modification History Date: 14-SEP-2007
Previously Published As 103020 Internal Comments Sun Alert 102991 is being published to inform customers of this issue/resolution. For questions or feedback on this asset, send email to FCO_279_Discussion@sunwebcollab.east.sun.com Hardware Remediation Details All Regions were materially ready at the time of this asset's publication. Related Information
Internal Contributor/submitter Tonya.Flynn@Sun.COM Internal Eng Business Unit Group KE Authors Internal Eng Responsible Engineer Rodger.Wilson@Sun.COM, Deanna.Demarco@Sun.COM Internal Services Knowledge Engineer Joe.Davis@Sun.COM Internal Escalation ID 1-20665197 Internal Kasp FAB Legacy ID 103020 Internal Sun Alert & FAB Admin Info Critical Category: Significant Change Date: 2007-09-13 Avoidance: Hardware Replacement Responsible Manager: David.Palmer@Sun.COM Original Admin Info: WF - Finalizing draft for Extended Review. - Joe 7/25/07 WF - made changes per input from TZ review. - Joe 8/2/07 WF - minor mods - awtg Material Ready. - Joe 8/7/07 WF - added discussion alias to Comments section. - Joe 8/8/07 WF - US, Ltn Am and EMEA are Materially Ready, and APac should be ready by 9/7/07, but TZs want until Tuesday to perform a final review on FAB and Sun Alert KE wants same time to send the related Sun Alert thru final review. Will Pub on 9/4. - Joe 8/31/07 WF - sdpsweb.central is finally back up after 10 days, so I can now publish this FCO. - Joe 9/13/07 Product_uuid 79ad78b9-961d-11d9-9adf-080020a9ed93|Sun Fire T1000 Server Attachments This solution has no attachment |
||||||||||||
|