Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1008390.1
Update Date:2011-03-09
Keywords:

Solution Type  Technical Instruction Sure

Solution  1008390.1 :   How to Verify whether a System Reboot is Caused by a Fatal Reset or a Red State Exception  


Related Items
  • Sun Fire V240 Server
  •  
  • Sun Fire V440 Server
  •  
  • Sun Fire V480 Server
  •  
  • Sun Fire V880z Visualization Server
  •  
  • Sun Fire V445 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Fire V210 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  

PreviouslyPublishedAs
211473


Applies to:

Sun Fire V880 Server
Sun Fire V890 Server
Sun Fire V210 Server
Sun Fire V240 Server
Sun Fire V440 Server
All Platforms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, Oracle Entrylevel Servers.

Goal

 This document will help identify if the reason for an unexpected or unexplained system reboot is due to a Fatal reset error or a Red State Exception (RSE) condition.

Please note that the purpose of this document is to help you with the root cause. In case the symptoms described in this document, are indeed what your system is experiencing, you will need to make a contact with qualified engineers at My Oracle Support (MOS). Please reference this document ID number once you are ready to make contact with MOS Support for assistance.


Solution

Steps to Follow
The unexpected reboots are most often caused by hardware faults and reported by the system as a fatal reset or a red state exception.

When errors like these occur, the OS  is abruptly interrupted and can't continue to log error messages in /var/adm/messages or generate a core file. As a result, the system reboots but the error messages and all output will only appear on the system console (will be in console logs). So in order to do further troubleshooting, it is very important to gather the complete console logs at the time of the error (reboot).

1. The system reboot could be due to fatal reset errors. The fatal errors are most often caused by hardware (bad CPU, MB switches, I/O bridge, etc.) and are the result of an 'illegal' hardware state that is detected by the system. The Fatal Reset error and all output are only logged to the system console (ttya or RSC). Here are examples of fatal errors caused by CPU and motherboard switch ASICs (the full fatal reset output is too long and is not included):

ERROR: System Hardware FATAL RESET from CPU0
System State (CPU3 reporting)
ERROR: System "FATAL RESET" from DAR/DCS/CDX
System State (CPU2 reporting)

For systems using ALOM serial console the fatal error would be reported as:

Fatal Error Reset
SC Alert: Host System has Reset

When your system reboots after fatal error, you will may also see ONLY a notice in the /var/adm/messages file like this one:

[ID 796976 kern.notice] System booting after fatal error FATAL Sys Hardware

Also, the prtconf -vp may show Fatal Sys Hardware message under " reset-reason: "

# prtconf -vp
System Configuration: Sun Microsystems sun4u
Memory size: 8192 Megabytes
System Peripherals (PROM Nodes):
.....................
banner-name: 'Sun Fire 880'
watchdog-enable:
reset-reason: 'FATAL Sys Hardware' <<<<<<<
model: 'SUNW,501-6323'

In case the console logs have fatal errors. If your system is experiencing these errors, please contact a qualified engineer at My Oracle Support (MOS) for assistance.

1.a) For the UltraSPARC III/IV platforms (280R, V480/V880, V490/V890) and UltraSPARC IIIi platforms (V210/V240, V440) a trained MOS Engineer has access to important information along with an AFAR decoder tool and will carefully guide you through the steps to resolution.

My Oracle Support can also assist you if you are experiencing V480 Fatal Resets with specific network and I/O configurations.

2. The unexpected reboot could also be due to Red State Exception (RSE) errors. The user needs to verify if the console output has any Red State Exception (RSE) errors. The RSE can be triggered by both Software and/or Hardware, but this condition is most commonly due to a hardware fault (bad DIMM or bad CPU/ L2SRAM). The RSE error and all output are only logged to the system console (ttya or RSC)  and usually is reported by one of the CPUs:

ERROR: CPU3 RED State Exception
System State (CPU3 reporting)

If your system does reboot after RSE, you may also see ONLY a notice in the /var/adm/messages file like this one:

[ID 993603 kern.notice] System booting after RED CPU RED-State
The prtconf -vp may show RED CPU RED-State message under " reset-reason: "

#prtconf -vp
System Configuration: Sun Microsystems sun4u
Memory size: 32768 Megabytes
System Peripherals (PROM Nodes):
banner-name: 'Sun Fire 880'
watchdog-enable:
reset-reason: 'RED CPU RED-State' <--- reset-reason

In case the console logs have RSE errors, once again, this is a critical issue where you will need a qualified MOS Support Engineer to assist you, so please contact a qualified engineer at MOS for assistance.:

2.a) for the UltraSPARC III/IV platforms (280R, V480/V880, V490/V890) and UltraSPARC IIIi platforms (V210/V240, V440) please contact MOS for assistance.



Product
Sun Fire V890 Server
Sun Fire V880z Visualization Server
Sun Fire V880 Server
Sun Fire V490 Server
Sun Fire V480 Server
Sun Fire V445 Server
Sun Fire V440 Server
Sun Fire V240 Server
Sun Fire V210 Server

Internal Comments
Audited/updated 11/17/09 - Ian.Macdonald@Sun.COM, Entry Level SPARC Content Team Member


Internal Comments:

This document contains normalized content
and is managed by the the Domain Lead(s) of the respective domains.
To notify content owners of a knowledge gap contained in this
document, and/or prior to updating this document, please contact
the domain engineers that are managing this document via the
"Document Feedback" alias(es) listed below:

Normalization Lead: Jim Robbins Domain Engineer/Lead : Josh
Freeman
VSP-SPARC-Normalization@sun.com
REFERENCES:

In case the console logs have fatal
errors, reference the following docs:

1.a) for the UltraSPARC III/IV platforms (280R, V480/V880,
V490/V890) refer to: Troubleshooting <Document: 1006524.1> : Sun Fire V880 FATAL
Resets.
<Document: 1003588.1> : V480 Fatal Resets with
specific network
and I/O configurations.

Note:
The procedures <Document: 1006524.1> apply to all V4x0/V8x0
platforms, since
they are using the same CPU/memory board.
1.b) for the UltraSPARC IIIi platforms (V210/V240, V440) you may
use US3i AFAR decoder tool in conjunction
with <Document: 1004903.1> : Event Messages for
UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R],
UltraSPARC-IV[R] and UltraSPARC-IV+[R] CPU Modules .

In case the console logs have RSE errors,
reference the following docs:

2.a) for the UltraSPARC III/IV platforms (280R, V480/V880,
V490/V890) refer to:
<Document: 1006530.1>
:
Troubleshooting Sun Fire V880 RED
STATE EXCEPTION .
<Document: 1012214.1> : Troubleshooting Red State
Exception
Memory Errors .
2.b) for the UltraSPARC IIIi platforms (V210/V240, V440) you may
use US3i AFAR decoder tool in
conjunction with <Document: 1004903.1> : Event Messages for
UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R],
UltraSPARC-IV[R] and UltraSPARC-IV+[R] CPU Modules .

More Reference Material:

Internal Tool: Fatal Reset Decoder

Internal Tool: RED State Exception Decoder


Internal Tool: US3iAFAR Decoder


Sun Alert Document: 1000380.1 Sun Systems Equipped ASICs Version
2.3 or Higher May Experience Either Domain
Stop (Dstop), Domain Pause or FATAL RESET Under Heavy I/O

FCO AO226-1 V480 Fatal Resets with specific network and I/O configurations

Sun Alert Document: 1000884.1 Sun Fire V440 and Netra 440
Systems Using a Specific Networking Configuration may Unexpectedly
Reset
Troubleshooting <Document: 1012214.1> Troubleshooting Red
State Exception Memory Errors
Troubleshooting <Document: 1006524.1> Sun Fire V880 FATAL Resets
Troubleshooting <Document: 1006530.1> Troubleshooting Sun
Fire V880 RED
STATE EXCEPTION
<Document: 1004903.1> : Event Messages for
UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R],
UltraSPARC-IV[R] and UltraSPARC-IV+[R] CPU Modules .

normalized, unexplained reboot, console logs, red state, fatal reset, Problem Solved = Identify Fatal Reset or Red State
Previously Published As
91380

Change History
Date: 2008-01-08
User Name: 7058
Action: Update Started
Comment: Updating doc per Jim Koontz and Dencho's approval to make it more suitable for customer viewing.
Version: 0


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback