Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1008335.1 : Sun[TM] X64/X86 Guide to System Troubleshooting
PreviouslyPublishedAs 211405
Applies to:Sun Fire X4240 ServerSun Fire X4250 Server Sun Ultra 40 M2 Workstation Sun Ultra 27 Workstation Sun Netra X4250 Server All Platforms GoalDescriptionThis document provides a high-level guide to troubleshooting documents for Oracle's Sun x64/x86 product line. To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems
The product links above contain general information about the
specific product. The Sun system handbook links from above contain
system specifications, parts lists, documentation, and the list of
minimum supported operating systems. System firmware, drivers, and
BIOS can be downloaded via the Download link.
SolutionSteps to FollowKernel Analysis A system becomes unresponsive for one of three reasons:
The following document provides some information about the necessary data to gather: DocID: 1010911.1 What to send to Sun[TM] after a system panic and/or unexpected reboot Fatal Reset Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the BIOS. One reason for this is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period. This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose. Other reasons for fatal resets are due to hardware failure like loss of input voltage, or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no shutdown messages). The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP):
If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable. PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu. OS Panic OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted. The panic routine will create a core dump if properly configured and place panic strings into the messages file to assist in fault isolation. Panics can be caused by either operating system coding errors which are typically fixed by patches, or caused by hardware related problems like memory Uncorrectable Errors (UE's). If software related, collect the core dump and pass to Sun's kernel group for analysis. If hardware related then collect the following data so the problem can be isolated:
If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable. PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu. Hangs A Hang is when some applications may operate properly, and others appear dead, but the hardware and operating system do not detect a problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, but please contact the kernel group first for isolation. DocID: 1012991.1 How to check if your x64 platform "system hang" actually is a system hang. This document can be referenced to assist with possible hang situations. The following operating system diagnostic section should be read to determine how to configure and force core dumps, but forcing a core dump from a hung system is not always possible. OS Troubleshooting Sun x86/x64 systems typically support Solaris[TM], Red Hat Enterprise Linux, SuSE Enterprise Linux and the Windows operating system. Please check the Sun Systems Handbook to ensure that the operating system in question is supported on that platform. A good overall operating system document to review is: DocID: 1019144.1 Data Requirements reference: What data is needed in order to troubleshoot my software or hardware problem? Solaris: Six important Solaris documents that discuss procedures and configuration for Solaris panics and hangs are as follows: DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris[TM] Operating System DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs DocID: 1004506.1 How to force a crash when my machine is hung DocID: 1001950.1 When to Force a Solaris[TM] System Core File DocID: 1004530.1 KERNEL: How to enable deadman kernel code DocID: 1003085.1 Solaris[TM] Operating System: Forcing a kernel core dump on an x86 or x64 system Red Hat Linux: Three important Red Hat documents that discuss procedures and configuration for Red Hat panics & hangs are as follows: DocID: 1005528.1 How to configure Kdump on Red Hat Enterprise Linux 5 systems DocID: 1006577.1 Red Hat Linux: Diskdump Pre-requisites, install and settings DocID: 1007699.1 Crash Dump capturing for Red Hat Linux SuSE Linux: Two important SuSE documents that discuss procedures and configuration for SuSE panics & hangs are as follows: DocID: 1108937.1 How to configure Kdump on SuSE Linux Enterprise System 10 DocID: 1010059.1 How to configure LKCD on SuSE Linux Enterprise Systems 8 and 9 Windows: An important Windows document that discusses procedures and configuration for panics is: DocID: 1007054.1 How to handle Microsoft Windows panics on x64 platforms Additional documents that assist in Windows troubleshooting are: DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status DocID: 1010936.1 Microsoft Windows and Linux operating systems: How to obtain troubleshooting information Disk and Redundant Array of Independent/Inexpensive Disks (RAID) Troubleshooting Disk and RAID problems are sometimes related to the disk/RAID controller firmware and boot configuration. A good overall document to determine the firmware revision from systems with a supported operating system and how to search for known issues is: DocID: 1008396.1 How to Identify Optical and Hard Disk Firmware Revisions for Checking of Known Issues A good document on boot related issues is: DocID: 1005506.1 How to verify your boot media exists and is bootable on a Sun Fire[TM] X4100/X4200/X4600 and M2 models Server Once the version is known, the following document can be used to provide information of how to list, create, or delete RAID volumes: DocID: 1005358.1 Hardware RAID usage on X64 based systems with the LSI SAS1064 The LSI RAID controller firmware requires 64MB unpartitioned disk space at the end of the disk for volume management. Thus, data backup prior any RAID creation should be performed. LSI related RAID status can be obtained via the BIOS as shown in the following: DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status Disks placed into a RAID volume should be of identical size to avoid problems. RAID levels are: RAID-0: Stripe of 2 or more disks to form a virtual larger disk. No redundancy so data lost on failure, but higher performance due to access to multiple disks for a file. RAID-1: Mirrors of 2 or more disks to provide redundant data copies to prevent data loss on disk failure. Write performance decreases due to 2 or more writes per single file update but read performance increases due to access to file access from multiple disks. RAID-01: Mirror of striped disks, but disk failure will offline its associated stripe. RAID-10: Stripes of mirrored disks which can tolerate loss of two disks depending on configuration. RAID-5: Stripes 3 or more disks with distributed parity so data loss is prevented if a disk fails. Medium performance is sustained since two writes are performed for each file update, but access is striped across multiple disks. The Solaris raidctl command provides RAID status and provides RAID creation & deletion information as described in the following: DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status Solaris commands that are helpful in disk troubleshooting, are as follows: # /usr/sbin/mount | grep "/ on" / on /dev/dsk/c1t0d0s0 read/write/setuid/devices/logging/xattr/onerror=panic/dev=f40040 on Thu Dec 6 11:49:54 2007 # iostat -E sd0 Soft Errors: 1 Hard Errors: 2 Transport Errors: 0 Vendor: AMI Product: Virtual CDROM Revision: 1.00 Serial No: Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0 Illegal Request: 1 Predictive Failure Analysis: 0 sd1 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: AMI Product: Virtual Floppy Revision: 1.00 Serial No: Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 # iostat -xe extended device statistics ---- errors --- device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 1 2 0 3 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 2 0 0 2 LINUX disk issues can be isolated using the following : DocID: 1013003.1 How to Identify if a Linux Operating Environment is Installed on a Hardware RAID Controller The following document indicates how to determine if a LINUX disk is under RAID control. Software RAID is configured using mdadm as discussed in: DocID: 1011427.1 How to setup software RAID in Linux LINUX commands that are helpful in disk troubleshooting, are as follows: # /bin/mount | grep "on / " (Display root mount point) /dev/sda2 on / type ext3 (rw) Windows disk status can be checked using information from the following: DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status An example of a Windows RAID installation is obtained from: DocID: 1009559.1 Installing Windows 2003 Server with RAID enabled on Sun Fire[TM] x2100 General Troubleshooting For problems not covered by the prior two sections, collect the following information:
IPMItool is a very useful tool that can gather information from the ILOM and other Service Processors (SP's). Example commands to collect are as follows replacing the "ipaddress" with the address of the service processor, not the main platform: ipmitool -H "ipaddress" -U root fru ipmitool -H "ipaddress" -U root sel elist ipmitool -H "ipaddress" -U root -v sdr ipmitool -H "ipaddress" -U root sdr elist ipmitool -H "ipaddress" -U root sdr list ipmitool -H "ipaddress" -U root chassis status ipmitool -H "ipaddress" -U root sunoem led get ipmitool -H "ipaddress" -U root sensor X64, troubleshooting, x86 Previously Published As 88276 Change History Date: 2009-12-01 User Name: Tony McNamara Action: Currency check Comment: Updated to add new products, re-defined descriptions of errors relating to x64 platforms, added new hang section, updated links and added new RAID section Date: 2010-06-02 User: brian.jackson@oracle.com Action: Implemented comment Comment: Updated link to How to configure Kdump on SuSE Linux Enterprise System 10 (Doc ID 1108937.1) Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|