Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1005476.1
Update Date:2010-08-30
Keywords:

Solution Type  Troubleshooting Sure

Solution  1005476.1 :   Troubleshooting Level 2 Check Errors (L2CheckError) on Sun Fire[TM] 3800/4800/4810/6800/E2900/E4900/E6900 & Netra[TM] 1280/1290  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Servers>NEBS-Certified Servers
  •  

PreviouslyPublishedAs
207600


Applies to:

Sun Fire E2900 Server
Sun Fire E4900 Server
Sun Fire E6900 Server
Sun Fire 6800 Server
Sun Netra 1290 Server
All Platforms

Purpose

Description

This document provides the steps required to be followed to troubleshoot Level 2 Check Error events (L2CheckErrors) on Sun Fire[TM] Midrange servers.

Symptoms:
  • A system or domain(s) may have been described as having gone down, rebooted unexpectedly, panicked, reset, or similar term.

  • The word L2CheckError may be displayed in errors on the System Controller (SC) console or in showlogs .

  • It may be reported that one domain was rebooted and another unexpectedly reset, panicked, or similar.

  • It may be reported that a System Board (SB), Repeater, I/O Board, CPU(s), Memory DIMMs, or similar component is labeled as faulty or suspect and may be missing or disabled.

  • It may be reported that the system or domain booted after the System Controller (SC) was failed over, rebooted, or reset.

System Type and Configuration:

  • Sun Fire[TM] 3800, 4800, 4810, 6800 Servers

  • Sun Fire[TM] E4900, E6900 Servers

  • Sun Fire[TM] v1280, E2900 Servers

  • Netra[TM] 1280, 1290 Servers

Notes: The system configuration includes at least System Controller Application (ScApp) 5.15.x. A device called a Repeater (RP) will be implicated by an L2CheckError event.  An RP is a type of board on all systems except for Sun Fire[TM] 3800 where the RPs are located on the system's Backplane/Centerplane.

Assumption:
This document assumes the event encountered is not a repeat event. Collect data as outlined in Step 5 below if this is a repeat event, and let the Sun Support Services engineer perform analysis.



Sun Shared Shell


If you require assistance in collecting the data recommended in this article or require help in diagnosing a system issue, there is a collaborative service tool called Sun Shared Shell which allows Sun Service engineers to remotely view and diagnose customer's systems. Consider using this option to reduce the problem resolution time.

Last Review Date

May 19, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Follow
Please validate that each troubleshooting step below is true for your environment.

The steps will provide instructions or a link to a document, for validating the step and taking corrective action  as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

1. Verify the event encountered is a L2CheckError.

  • Generally speaking the error message will include L2CheckError and/or events that indicate a mismatched against expected.

  • An example of what these messages might look like is:

SC-Name:SC> showlogs -d c
Dec 29 10:23:51 sc0 Domain-C.SC: [ID 272780 local5.error] ArAsic reported first error on /N0/SB1
Dec 29 10:23:51 sc0 Domain-C.SC: [ID 807371 local5.error]
/partition1/domain0/SB1/ar0:
    L2CheckError[0x6150] : 0x00608060
        AccIncSyncErr [24:21] : 0x3 accumulated incoming mismatch
                   FE [15:15] : 0x1
           INCSyncErr [08:05] : 0x3 Ports [9:6] incoming mismatched against internal expected incoming

    2. Verify this is not a "known" memory interleaving issues

    • Reference Sun Alert <Document: 1000448.1 >

    Note: Sun Fire[TM] v1280/E2900 & Netra[TM] 1280/1290 are excluded from this step because they can not have a multiple domain configuration.

    3. Verify this is not a "known" adjacent domain issue.

    • Reference Sun Alert <Document: 1000008.1 >

    Note: Sun Fire[TM] v1280/E2900 & Netra[TM] 1280/1290 are excluded from this step because they can not have a multiple domain configuration.

    4. Verify this is not a "known" Dynamic Reconfiguration (DR)/cfgadm issue.

    • Reference Sun Alert <Document: 1001300.1 >

    5. Customers should contact Sun Support Services, mention this document ID, and verify extended Explorer data is available for analysis or be prepared to use Sun Shared Shell to continue diagnosis of the event.

    • Sun Shared Shell is a collaborative service tool which allows Sun Service engineers to remotely view and diagnose customer's systems. Consider using this option to reduce the problem resolution time.

    • If the shared shell option above is not available, the Sun Support Engineer will verify the previous steps have been performed and then perform analysis offline using Explorer data.

      • Reference <Document: 1002383.1 > Sun[TM] Explorer Data Collector

      • See the supportfiles.sun.com User Guide for information on how to upload the data to the Sun's ftp site.



    Product
    Sun Netra 1290 Server
    Netra 1280 Server
    Sun Fire V1280 Server
    Sun Fire E6900 Server
    Sun Fire E4900 Server
    Sun Fire E2900 Server
    Sun Fire 6800 Server
    Sun Fire 4810 Server
    Sun Fire 4800 Server
    Sun Fire 3800 Server

    Internal Comments
    Performing Additional Analysis Offline


    Verify that Steps 1-5 in the Steps to Follow section above have been performed prior to commencing with step 6. 

    6.Verify this is not a repeat event. 
      A repeat event is an event that has:
       - has an identical failure signature and suspect indictment list or
       - the customer may report or feel the event is reoccurring on the same system/platform.

    Repeat events require collaboration with the next level of support ( Step 11 ). 

    7. Verify that this event is not caused by a power failure on a  System (SB) or I/O Board (IB). 

    A power failure of a System or I/O Board can be  easily identified by the following message appearing in the System Controller showlogs or showlogs -v domainID file: 


    Path broken between CBH and SDC:SB# ----> For a SB fault.
    Path broken between CBH and SDC:IB# ----> For a IB fault.

    If the message shown above is present for a System Board (SB), utilize 
    <Solution 243326: xxxxx> to resolve this issue

    If the message shown above is present for an I/O Board (IB), utilize 
    <Solution: 229081>  to resolve this issue.

    8. Verify that the Auto-Diagnosis (AD) Event Message 
    "FRU-LOC" does not say "UNRESOLVED". 
     
    The AD Event Messages are contained in the 
    System Controller (SC) log files (showlogs or 
    showlogs -d ). Look for the AD Event Message appropriate to the date/time of the
    event in question.

    The following example identifies suspects RP3 and SB0:

    Jul 07 21:56:47 sc0 Domain-C.SC:
    [AD] Event: SF6800.ASIC.AR.INC_SYNC_ERR.1024106f
    CSN: 136M2383 DomainID: C ADInfo: 1.SCAPP.19.3
    Time: Wed Jul 07 21:56:38 PDT 2004

    FRU-List-Count: 2; FRU-PN: 5014953; FRU-SN: 013023;  FRU-LOC: RP3

    FRU-PN: 5014362; FRU-SN: 017608; 
    FRU-LOC: /N0/SB0

    Recommended-Action: Service action required
    The following example says "UNRESOLVED":
    Dec 29 10:23:51 systemx Domain-C.SC:
    [ID 436815 local5.error] [AD] Event: SF6800
    CSN: 313H3174 DomainID: C ADInfo: 1.SCAPP.19.3
    Time: Mon Dec 29 10:23:51 CST 2003

    FRU-List-Count: 0; FRU-PN:  ; FRU-SN:  ; 
    FRU-LOC: UNRESOLVED

    Recommended-Action: Service action required
    Collaborate with the next level of support (see Step 11)
    if UNRESOLVED or unable or unsure 
    how to determine this.

    9. Identify and replace the Primary Suspect from the 
    AD Event Message "FRU-LOC" indictment. 

    The FRU-LOC (Field Replaceable Unit Location) 
    indictment compose a list of suspects including
    SBs, IBs, and RPs.

    Count the number of individual SBs + IBs versus individual
    RPs listed in the AD Event Message and compare the totals to
    the table below.
    ---------------------------------------------------
    Number of     Number of         Primary
      SB & IB             RP               Suspect
    ---------------------------------------------------
            1                    1                 SB or IB
    ---------------------------------------------------
            1                 2 (or +)          SB or IB
    ---------------------------------------------------
            2 (or +)          1                 RP
    ---------------------------------------------------
            2 (or +)       2 (or +)          Collaborate
    ---------------------------------------------------
     
    From the event message example in Step 7 where SB0 and RP3
    were implicated, the Table identifies that when the number of 
    "SB & IB" and "RP" are both "1", it is the
    SB or IB which is the primary suspect.  In this example,
    that would be SB0 .

    Collaborate with the next level of support (see Step 11) 
    if unable or unsure how to determine this.
    10. Verify the problem does not reoccur within 24 hours
    after replacing the Primary Suspect.
     
    Replacement procedures are located in the Systems Service Manual
    for each server by accessing the appropriate system's
    Hardware link through the Midframe & Midrange Servers 
    Product Documentation Website .

    11. Verify the latest data is available and collaborate with the next level of support. 

    The information needing to be provided includes: 
     
    Explorer with the appropriate scextended or 1280extended
    option as detailed in How to run Sun data and send to Sun engineer
     
    If Explorer data can not be collected for whatever reason 
    see Procedure to collect Sunfire Midrange failure data manually

    Detailed listing of all previous service actions, identifying parts replaced and dates of service.

    Confirmation that the previous steps in this resolution path were performed (unless this is a repeat event).


    Resources for continued troubleshooting:
     
    <Document: 1006221.1 > Sun Fire[TM] Servers: How L2CheckErrors Happen
    <Document: 1009156.1 > SDC Parity errors and SDC L2CheckError discussion
     

    Document Notes

    This document contains normalized content and is managed by the the
    Domain Lead(s) of the respective domains. To notify content owners of
    a knowledge gap contained in this document, and/or prior to updating
    this document, please contact the domain engineers that are managing
    this document via the Document Feedback alias listed below: 

    Product Domain/Family: MSG/Serengeti (Hardware Troubleshooting) 
                                             MSG/Lw8 (Hardware Troubleshooting) 

    Support Aliases: 
       -Serengeti: serengeti-support@sun.com
       -Lw8: lw8-support@sun.com
     
    Alias Archives
       - Serengeti: http://mailfinder3.sfbay/alias/serengeti-support
       - Lw8: http://mailfinder3.sfbay/alias/lw8-support

    Instant Message Forum:
       - GL-ESG

    Call Management Queue
       -IBIS: GL-ESG
     
     
    Sun Fire, Lightweight8, Serengeti, lw8, Lw8, Level 2 Check Error, 
    l2checkerror, l2check, L2CheckError, crash, reboot, path broken,
    CBH, SDC, normalized

    Previously Published As 88171






Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback