Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1008335.1
Update Date:2011-05-27
Keywords:

Solution Type  Technical Instruction Sure

Solution  1008335.1 :   Sun[TM] X64/X86 Guide to System Troubleshooting  


Related Items
  • Sun Fire X4140 Server
  •  
  • Sun Ultra 27 Workstation
  •  
  • Sun Fire V20z Compute Grid Rack System
  •  
  • Sun Ultra 24 Workstation
  •  
  • Sun Java Workstation W2100z
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Ultra 20 Workstation
  •  
  • Sun Fire X4150 Server
  •  
  • Sun Fire X2200 M2 Server
  •  
  • Sun Ultra 20 M2 Workstation
  •  
  • Sun Netra CT900 Server
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire X4450 Server
  •  
  • Sun Fire V20z Server
  •  
  • Sun Fire X2250 Server
  •  
  • Sun Netra X4200 M2 Server
  •  
  • Sun Fire V40z Server
  •  
  • Sun Java Workstation W1100z
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Fire X4250 Server
  •  
  • Sun Netra X4250 Server
  •  
  • Sun Ultra 40 Workstation
  •  
  • Sun Fire X2100 Server
  •  
  • Sun Ultra 40 M2 Workstation
  •  
  • Sun Fire X2100 M2 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>x64 Servers
  •  

PreviouslyPublishedAs
211405


Applies to:

Sun Fire X4240 Server
Sun Fire X4250 Server
Sun Ultra 40 M2 Workstation
Sun Ultra 27 Workstation
Sun Netra X4250 Server
All Platforms

Goal

Description

This document provides a high-level guide to troubleshooting documents for Oracle's Sun x64/x86 product line.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems


Sun System Handbook

Docs

Downloads

Service Processor

Oracle Page

Workstations:

Sun Ultra 20

Docs

Download

None


Sun Ultra 20 M2

Docs

Download

None


Sun Ultra 24

Docs

Download

None


Sun Ultra 27

Docs

Downloads

None


Sun Ultra 40

Docs

Download

None


Sun Ultra 40 M2

Docs

Download

None


Sun Java W1100z

Docs

Download

None


Sun Java W2100z

Docs

Download

None


Servers:

Sun Fire X2100

Docs

Download

SMDC (option)

Sun x86 Systems

Sun Fire X2100 M2

Docs

Download

ELOM


Sun Fire X2200 M2

Docs

Download

ELOM


Sun Fire X2250

Docs

Download

ILOM


Sun Fire X4100

Docs

Download

ILOM


Sun Fire X4100 M2

Docs

Download

ILOM


Sun Fire X4140

Docs

Download

ILOM


Sun Fire X4150

Docs

Download

ILOM


Sun Fire X4200

Docs

Download

ILOM


Sun Fire X4200 M2

Docs

Download

ILOM


Sun Fire X4240

Docs

Download

ILOM


Sun Fire X4250

Docs

Download

ILOM


Sun Fire X4440

Docs

Download

ILOM


Sun Fire X4450

Docs

Download

ILOM


Sun Fire X4500

Docs

Download

ILOM


Sun Fire X4540

Docs

Download

ILOM


Sun Fire X4600

Docs

Download

ILOM


Sun Fire X4600 M2

Docs

Download

ILOM


Sun Fire V20z

Docs

Download

SP


Sun Fire V40z

Docs

Download

SP


Blade Servers:

Sun Blade 1600

Docs

Patches

Switch SC (SSC)

Sun Blade Servers

Sun Blade 6000

Docs

Download

ILOM


Sun Blade 8000

 

Docs

Download

ILOM


Netra Blades And Servers:

Netra X4200 M2

Docs

Download

ILOM

Sun Netra Carrier-Grade Servers

Netra X4250

Docs

Download

ILOM


Netra X4450

Docs

Download

ILOM


Netra CT900

Docs

Download

ShMM


The product links above contain general information about the specific product. The Sun system handbook links from above contain system specifications, parts lists, documentation, and the list of minimum supported operating systems. System firmware, drivers, and BIOS can be downloaded via the Download link.

Solution

Steps to Follow

Kernel Analysis

A system becomes unresponsive for one of three reasons:
  • Fatal reset (hardware detected)
  • Operating system panic (software detected)
  • Operating system/application hang (not detected)

The following document provides some information about the necessary data to gather:
DocID: 1010911.1 What to send to Sun[TM] after a system panic and/or unexpected reboot

Fatal Reset
Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the BIOS.
One reason for this is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period.
This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose.

Other reasons for fatal resets are due to hardware failure like loss of input voltage, or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no shutdown messages).

The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP):

  • Console output. This typically contains a reason for the reset for example "sync flood" (or nothing for total power loss).
  • SP events. This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's.
  • SP sensor data. This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
  • SP field replaceable unit (FRU) data. This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
  • Explorer or other operating system data collector that contains the messages files and other data.

If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable.

PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

OS Panic
OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted. The panic routine will create a core dump if properly configured and place panic strings into the messages file to assist in fault isolation.

Panics can be caused by either operating system coding errors which are typically fixed by patches, or caused by hardware related problems like memory Uncorrectable Errors (UE's).

If software related, collect the core dump and pass to Sun's kernel group for analysis.
If hardware related then collect the following data so the problem can be isolated:

  • Explorer or other operating system data collector that contains the messages files and other data. This typically contains panic messages and a stack trace related to the panic.
  • SP events. This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's.
  • SP sensor data. This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
  • SP FRU data. This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.

If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable. PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

Hangs
A Hang is when some applications may operate properly, and others appear dead, but the hardware and operating system do not detect a problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, but please contact the kernel group first for isolation.

DocID: 1012991.1 How to check if your x64 platform "system hang" actually is a system hang.

This document can be referenced to assist with possible hang situations.

The following operating system diagnostic section should be read to determine how to configure and force core dumps, but forcing a core dump from a hung system is not always possible.

OS Troubleshooting
Sun x86/x64 systems typically support Solaris[TM], Red Hat Enterprise Linux, SuSE Enterprise Linux and the Windows operating system.
Please check the Sun Systems Handbook to ensure that the operating system in question is supported on that platform.
A good overall operating system document to review is:

DocID: 1019144.1 Data Requirements reference: What data is needed in order to troubleshoot my software or hardware problem?

Solaris:
Six important Solaris documents that discuss procedures and configuration for Solaris panics and hangs are as follows:

DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris[TM] Operating System
DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
DocID: 1004506.1 How to force a crash when my machine is hung
DocID: 1001950.1 When to Force a Solaris[TM] System Core File
DocID: 1004530.1 KERNEL: How to enable deadman kernel code
DocID: 1003085.1 Solaris[TM] Operating System: Forcing a kernel core dump on an x86 or x64 system

Red Hat Linux:
Three important Red Hat documents that discuss procedures and configuration for Red Hat panics & hangs are as follows:

DocID: 1005528.1 How to configure Kdump on Red Hat Enterprise Linux 5 systems
DocID: 1006577.1 Red Hat Linux: Diskdump Pre-requisites, install and settings
DocID: 1007699.1 Crash Dump capturing for Red Hat Linux

SuSE Linux:
Two important SuSE documents that discuss procedures and configuration for SuSE panics & hangs are as follows:
DocID: 1108937.1 How to configure Kdump on SuSE Linux Enterprise System 10
DocID: 1010059.1 How to configure LKCD on SuSE Linux Enterprise Systems 8 and 9

Windows:
An important Windows document that discusses procedures and configuration for panics is:
DocID: 1007054.1 How to handle Microsoft Windows panics on x64 platforms

Additional documents that assist in Windows troubleshooting are:
DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status
DocID: 1010936.1 Microsoft Windows and Linux operating systems: How to obtain troubleshooting information

Disk and Redundant Array of Independent/Inexpensive Disks (RAID) Troubleshooting
Disk and RAID problems are sometimes related to the disk/RAID controller firmware and boot configuration.

A good overall document to determine the firmware revision from systems with a supported operating system and how to search for known issues is:
DocID: 1008396.1 How to Identify Optical and Hard Disk Firmware Revisions for Checking of Known Issues

A good document on boot related issues is:
DocID: 1005506.1 How to verify your boot media exists and is bootable on a Sun Fire[TM] X4100/X4200/X4600 and M2 models Server

Once the version is known, the following document can be used to provide information of how to list, create, or delete RAID volumes:
DocID: 1005358.1 Hardware RAID usage on X64 based systems with the LSI SAS1064

The LSI RAID controller firmware requires 64MB unpartitioned disk space at the end of the disk for volume management. Thus, data backup prior any RAID creation should be performed.
LSI related RAID status can be obtained via the BIOS as shown in the following:

DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status

Disks placed into a RAID volume should be of identical size to avoid problems.

RAID levels are:
RAID-0: Stripe of 2 or more disks to form a virtual larger disk. No redundancy so data lost on failure, but higher performance due to access to multiple disks for a file.
RAID-1: Mirrors of 2 or more disks to provide redundant data copies to prevent data loss on disk failure. Write performance decreases due to 2 or more writes per single file update but read performance increases due to access to file access from multiple disks.
RAID-01: Mirror of striped disks, but disk failure will offline its associated stripe.
RAID-10: Stripes of mirrored disks which can tolerate loss of two disks depending on configuration.
RAID-5: Stripes 3 or more disks with distributed parity so data loss is prevented if a disk fails. Medium performance is sustained since two writes are performed for each file update, but access is striped across multiple disks.

The Solaris raidctl command provides RAID status and provides RAID creation & deletion information as described in the following:

DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status

Solaris commands that are helpful in disk troubleshooting, are as follows:

# /usr/sbin/mount | grep "/ on"

/ on /dev/dsk/c1t0d0s0 read/write/setuid/devices/logging/xattr/onerror=panic/dev=f40040 on Thu Dec 6 11:49:54 2007

# iostat -E

sd0 Soft Errors: 1 Hard Errors: 2 Transport Errors: 0
Vendor: AMI Product: Virtual CDROM Revision: 1.00 Serial No: Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
sd1 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: AMI Product: Virtual Floppy Revision: 1.00 Serial No: Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0

# iostat -xe

extended device statistics ---- errors ---
device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot
sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 1 2 0 3
sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 2 0 0 2

LINUX disk issues can be isolated using the following :
DocID: 1013003.1 How to Identify if a Linux Operating Environment is Installed on a Hardware RAID Controller

The following document indicates how to determine if a LINUX disk is under RAID control.
Software RAID is configured using mdadm as discussed in:
DocID: 1011427.1 How to setup software RAID in Linux

LINUX commands that are helpful in disk troubleshooting, are as follows:

# /bin/mount | grep "on / " (Display root mount point) /dev/sda2 on / type ext3 (rw)

Windows disk status can be checked using information from the following:

DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status

An example of a Windows RAID installation is obtained from:
DocID: 1009559.1 Installing Windows 2003 Server with RAID enabled on Sun Fire[TM] x2100

General Troubleshooting
For problems not covered by the prior two sections, collect the following information:
  • Obtain SP related data in all cases. This can be done via ipmitool (see below), or via the SP's GUI or command line interfaces (if functionality exists; see SP link above).
  • Ensure that the installed operating system is supported per the Sun System Handbook link above.
  • When possible, obtain operating system data collectors such as explorer or other output that records the state of the operating system and file system (including messages files).
  • PCcheck & other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.
IPMItool
IPMItool is a very useful tool that can gather information from the ILOM and other Service Processors (SP's).

Example commands to collect are as follows replacing the "ipaddress" with the address of the service processor, not the main platform:

ipmitool -H "ipaddress" -U root fru
ipmitool -H "ipaddress" -U root sel elist
ipmitool -H "ipaddress" -U root -v sdr
ipmitool -H "ipaddress" -U root sdr elist
ipmitool -H "ipaddress" -U root sdr list
ipmitool -H "ipaddress" -U root chassis status
ipmitool -H "ipaddress" -U root sunoem led get
ipmitool -H "ipaddress" -U root sensor






X64, troubleshooting, x86
Previously Published As
88276

Change History
Date: 2009-12-01
User Name: Tony McNamara
Action: Currency check
Comment: Updated to add new products, re-defined descriptions of errors relating to x64 platforms, added new hang section, updated links and added new RAID section
Date: 2010-06-02
User: brian.jackson@oracle.com
Action: Implemented comment
Comment: Updated link to How to configure Kdump on SuSE Linux Enterprise System 10 (Doc ID 1108937.1)


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback