Document Audience: | INTERNAL |
Document ID: | A0226-1 |
Title: | Sun Fire V480 system may experience a fatal reset and reboot with "ERROR: System "FATAL RESET" from DAR/DCS/CDX". |
Copyright Notice: | Copyright © 2007 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | Fri Apr 23 00:00:00 MDT 2004 |
----------------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
----------------------------------------------------------------------------
FIELD CHANGE ORDER
(For Authorized Distribution by Enterprise Services)
FCO #: A0226-1
Status: inactive
Synopsis: Sun Fire V480 system may experience a fatal reset and reboot with "ERROR: System "FATAL RESET" from DAR/DCS/CDX".Date: Apr/23/2004
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire V480
Product Category: Server / System Component
Product Affected:
Systems Affected:
Mkt_ID Platform Model Description
------ -------- ----- -----------
- A37 All Sun Fire V480
X-Options Affected:
Mkt_ID Platform Model Description
------ -------- ----- -----------
N/A
Parts Affected:
Part Number Description
----------- -----------
501-6733-xx V480 Centerplane
501-5819-xx V480 Centerplane
(SCSI Devices)
Type Vendor Model SerialNumber(Min) SerialNumber(Max) Firmware
---- ------ ------- ----------------- ----------------- --------
N/A
References:
BugID: 4914247
ESC: 545325, 547384, 547500, 547816
FIN: I1016-1
LEAP: 2472
BugID: 4898531
Issue Description:
Change History
--------------
A0226-1:
Date Modified: Apr/23/2004
Updates: PROBLEM DESCRIPTION, CORRECTIVE ACTION
. PROBLEM DESCRIPTION: Added paragraph in Additional Description section
explaining possibility of experiencing a hard hang.
. CORRECTIVE ACTION: Added a second good part number (501-6790-01 or later).
------------------------------------------------------------------------
In an extremely limited number of applications, and with a single system
configuration, the Sun Fire V480 system may experience a fatal reset and
reboot with the following error message:
ERROR: System "FATAL RESET" from DAR/DCS/CDX.
The specific configuration is as follows:
- "ce1" (Cassini ASIC) onboard interface is configured
with PCI cards in 66 MHz PCI slots (slot 0 or 1)
Note 1: From testing we know that high data activity occurring to or from a
PCI card installed (as described above) seems to trigger this issue more
often. There is no way to determine exactly the amount of activity which
may cause this reset.
Note 2: If you have extra Cassini based network cards installed in the
V480, the onboard "ce1" network port may be numbered something other than
"ce1". For the port numbering to change, you must have one of the two
supported cassini based cards installed in the V480 system, P/N 501-5524 or
501-5902.
An example would be if you have one of the cassini based cards installed in
one of the PCI slots, then the onboard ce ports will not be numbered "ce0"
and "ce1". Controller id's get assigned directly from the way the PCI
busses are numbered. In Sun Fire V480 we have two Schizos (at Safari
addresses 8 and 9). Each Schizo has 2 PCI busses (at local schizo addresses
600000 and 700000). This gives us 4 PCI busses:
/pci@8,700000 "B" (33 mhz) bus on "first" schizo
/pci@8,600000 "A" (66 mhz) bus on "first" schizo
/pci@9,700000 "D" (33 mhz) bus on "second" schizo
/pci@9,600000 "C" (66 mhz) bus on "second" schizo
Solaris walks thru the device tree and it assigns controller id's in the
order it finds them, so in the V480's case, it finds them in the order
listed above. PCI probe list (device build order):
/pci@8,700000/@2 (bus B, PCI slot 2, 33 MHz)
/pci@8,700000/@3 (bus B, PCI slot 3, 33 MHz)
/pci@8,700000/@4 (bus B, PCI slot 4, 33 MHz)
/pci@8,700000/@5 (bus B, PCI slot 5, 33 MHz)
/pci@8,700000/ide@6 (bus B, onboard IDE, DVD-ROM)
/pci@8,600000/@1 (bus A, PCI slot 0, 66 MHz)
/pci@8,600000/@2 (bus A, PCI slot 1, 66 MHz)
/pci@9,700000/ebus@1 (bus D, serial, pmc, rsc, etc.)
/pci@9,700000/usb@1,3 (bus D, USB ports)
/pci@9,700000/network@2 (bus D, ce0, net0, onboard 10/100/1000 ethernet
interface)
/pci@9,600000/network@1 (bus C, ce1, net1, onboard 10/100/1000 cassini
interface)
/pci@9,600000/SUNW,qlc@2 (bus C, onboard FC-AL, ISP2200)
So if a cassini based card is in slot 2, for instance, it would take the
controller number of "ce0". The onboard ports would then become "ce1" and
"ce2" respectively.
Additional Description
----------------------
The error condition that causes this issue is configuration specific. The
root cause analysis indicates that the "ce1" traffic on PCI bus C is the
main contributor to the failure mode and that any card in the 66MHz slots,
in conjunction with "ce1" activity, could trigger the unexpected system
fatal reset.
When this issue occurs, the system will FATAL RESET. After the reset, the
system reboots and the message "FATAL RESET from DAR/DCS/CDX" will appear
on the system console, along with further failure message output. No core
files are generated and the fatal reset output does not get logged to the
/var/adm/messages file.
Depending on the customer environment, in some rare cases the failure mode
may be different and the system may experience a system hard hang with no
error messages (including console logs).
IMPORTANT NOTE:
If a system experiences a FATAL RESET with this signature ie "ERROR: System
"FATAL RESET" from DAR/DCS/CDX", it should NOT be assumed that it is the
same cause referenced within this FCO. The error above is somewhat generic
and can be caused for other reasons. Further troubleshooting MUST be
performed to determine if the system in question is experiencing this
specific problem.
Corrective action was made available in manufacturing on January 16, 2004
when the 501-6780-01 Centerplane was phased-in per ECO# WO_28245, to replace
both the 501-5819 and 501-6733 Centerplanes. Corrective action was made
available in Sun Services via LEAP# 2472 on December 17, 2003.
Implementation:
---
| | MANDATORY (Fully Pro-Active)
---
---
| | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| X | UPON FAILURE
---
Replacement Time Estimate:
2.0 hours
Special Considerations:
There are two recommended ways of resolving this issue.
1. Permanent fix is to replace the Centerplane Board with 501-6780-01
(or later) as described in the Corrective Action section below.
2. Workarounds - reconfigure the system in the following manner:
a) Use only the "ce0" onboard network interface. The "ce1" onboard
interface should not be configured or disabled via (ifconfig(1M))
and, if necessary, a second network interface card can be installed.
When the system is not using "ce1", then all 6 PCI slots can be used
to install additional PCI cards, according to the customer needs.
OR
b) When the system is using "ce1", all PCI cards should be moved from
66 MHz PCI slots (slot 0,1) to 33 MHz slots (PCI slots 2,3,4,5) if
this is acceptable to the customer.
The first step that should be taken is to the configuration of the system as
described above. If the system continues to fail then this FCO is not
applicable, and further troubleshooting should commence to determine the
true cause of the affected system.
If the system stops failing after changing from the affected configuration,
put the system back into the affected configuration again to repeat and
verify the failure mode. If the system fails again then it can be assumed
this issue has been experienced and the centerplane should be replaced.
Corrective Action:
After following the instructions in the Special Considerations section above,
it is determined the described issue of this FCO is occurring replace the
centerplane as follows;
- replace 501-5819-xx with 501-6780-01 (or later)
or 501-6790-01 (or later)
- replace 501-6733-xx with 501-6780-01 (or later)
or 501-6790-01 (or later)
Comments:
None
Billing Type:
Warranty: Sun will provide parts at no charge under Warranty
Service. On-Site Labor Rates are based on how the
system was initially installed.
Contract: Sun will provide parts at no charge. On-Site Labor Rates
are based on the type of service contract.
Non Contract: Sun will provide parts at no charge. Installation by
Sun is available based on the On-Site Labor Rates
defined in the Price List.
--------------------------------------------------------------------------
Implementation Footnote:
________________________
i) In case of Mandatory FCOs, Sun Services will attempt to contact
all known customers to recommend the part upgrade.
ii) For controlled proactive swap FCOs, Sun Services mission critical
support teams will initiate proactive swap efforts for their respective
accounts, as required.
iii) For Replace upon Failure FCOs, Sun Services partners will implement
the necessary corrective actions as and when they are required.
--------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunSolve Internal Access:
_______________________
* Access the SunSolve Online URL at http://sunsolve.Central/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
_______________
* Access the top level URL of https://spe.sun.com
FIN/FCO Homepage Access:
_________________________
* Access the top level URL of http://sdpsweb.Central/FIN_FCO/index.html
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
To submit either a FIN or FCO refer to the following URLs for templates
and instructions;
* For FCO: http://pronto.central/fco.html
* For FIN: http://pronto.central/fin.html
--------------------------------------------------------------------------
General:
________
Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------