Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1021113.1
Update Date:2011-06-07
Keywords:

Solution Type  Troubleshooting Sure

Solution  1021113.1 :   Sun Storage[TM] Arrays: Troubleshooting RAID Controller Failures  


Related Items
  • Sun Storage 6180 Array
  •  
  • Sun Storage 6580 Array
  •  
  • Sun Storage Flexline 280 Array
  •  
  • Sun Storage 2510 Array
  •  
  • Sun Storage 2540 Array
  •  
  • Sun Storage 2540-M2 Array
  •  
  • Sun Storage 6780 Array
  •  
  • Sun Storage 6140 Array
  •  
  • Sun Storage Flexline 210 Array
  •  
  • Sun Storage 2530 Array
  •  
  • Sun Storage 2530-M2 Array
  •  
  • Sun Storage Flexline 380 Array
  •  
  • Sun Storage 6540 Array
  •  
  • Sun Storage 6130 Array
  •  
  • Sun Storage Flexline 240 Array
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - 6xxx Arrays
  •  
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - 2xxx Arrays
  •  
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - Flexline FLX FLA FLC Arrays
  •  

PreviouslyPublishedAs
271129


Applies to:

Sun Storage 2510 Array - Version: Not Applicable and later   [Release: N/A and later ]
Sun Storage 2530 Array - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 2530-M2 Array - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 2540 Array - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 2540-M2 Array - Version: Not Applicable and later    [Release: N/A and later]
All Platforms

Purpose

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 6000 and 2500 Series RAID Arrays

The purpose of this document is to describe how to troubleshoot Sun Storage[TM] RAID controller failures.

Symptoms:
  • Seven Segment Display of controller shows a repeating pattern
    • 88
    • L# (where # is some value)
  • Amber LED on controller
  • Critical Fault for RPA Memory Error(xx.66.1041)
  • Critical Fault for Controller is Offline(xx.66.1028)
Please validate that each troubleshooting step below is true for your environment. Each step will provide instructions via a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

Last Review Date

June 15, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

1. Verify Array Critical Faults

Reference <> Verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 Critical Faults via the User Interface

  • If fault listed as OFFLINE or RPA Memory Error,  go to Step 2.
  • Otherwise go to Step 11.

2. Verify Array Model

Reference <> Verify Sun Storage[TM] Array Array Type via the User Interface

Array Model
Instructions
  • Sun StorageTek[TM] 2510
  • Sun StorageTek[TM] 2530
  • Sun StorageTek[TM] 2540
  • If the Critical Fault from Step 1 was Controller Offline, go to Step 5.
  • If the Critical Fault from Step 1 was RPA Memory Error, go to Step 7.
  • Sun StorageTek[TM] 6140
  • Sun StorageTek[TM] 6540
  • Sun StorageTek[TM] Flexline 380
  • If the Critical Fault from Step 1 was Controller Offline, go to Step 3.
  • If the Critical Fault from Step 1 was RPA Memory Error, go to Step 7.
  • Sun StorEdge[TM] 6130
  • Sun StorageTek[TM] Flexline 240
  • Sun StorageTek[TM] Flexline 280
  • If the Critical Fault from Step 1 was Controller Offline, go to Step 5.
  • If the Critical Fault from Step 1 was RPA Memory Error, go to Step 10.
  • Sun Storage 6180
  • Sun Storage 2530 M2
  • Sun Storage 2540 M2
  • If the Critical Fault from Step 1 was Controller Offline, go to Step 4.
  • If the Critical Fault from Step 1 was RPA Memory Error, go to Step 10.
  • Sun Storage 6580
  • Sun Storage 6780
  • If the Critical Fault from Step 1 was Controller Offline, go to Step 4.
  • If the Critical Fault from Step 1 was RPA Memory Error, go to Step 7.


3.  Verify 7-segment Display on 6140/6540/FLX380 array controller.

Currently the user interface does not display what is being shown on the seven segment display that normally shows the tray ID for the array under optimal conditions.  For arrays of these types, we can get additional status of the system.  If you are not local to the system, you will need someone to look at the ID.  The display can vary based on the array model and the error status.

Reference <> Sun StorageTek[TM] 6140, 6540, and Flexline 380 Array Controller 7-Segment LED

  • If Seven Segment Display shows "88", this indicates a possible intermittent issue, go to Step 5.
  • If 7-Segment Shows L2 or L3 for the controller, the subsystem has offlined the controller due to persistent memory faults(L2) or Hardware(L3).  Go to Step 10.
  • If the 7-Segment Shows an L-code but is not L2 or L3 go to Step 11.

4.  Verify 7-segment Display on 6180/6580/6780/2530M2/2540M2 array controller.

Currently the user interface does not display what is being shown on the seven segment display that normally shows the tray ID for the array under optimal conditions.  For arrays of these types, we can get additional status of the system.  If you are not local to the system, you will need someone to look at the ID.  The display can vary based on the array model and the error status.

Reference <> Sun Storage[TM] 6x80 and 2500-M2 Array Controller 7-Segment Display

For 2530-M2/2540-M2/6180/6580/6780:
  • If 7-Segment Display flashes either OS+ OL+ blank- or SE+ 88+ blank-, this indicates a possible intermittent issue, go to Step 5.
  • If 7-Segment Display flashes: 0E+ L2+ dash+ CF+ P#+ blank-, SE+ dF+ dash+ CF+ P#+ blank-, or OE+ L3+ blank-, this indicates that a Controller Processor Memory DIMM has failed due to parity errors(L2) or system has detected hardware fault, and has placed the controller offline, go to Step 10.
  • If 7-Segment Display flashes: 0E+ L2+ dash+ CF+ d#+ blank-, or SE+ dF+ dash+ CF+ d#+ blank-, this indicates that the Processor Memory on the 6180 has failed due to Parity errors, or the system detected a hardware fault, and placed the controller offline.  Go to Step 10.

FOR 6580/6780 ONLY
  • If 7-Segment Display flashes: 0E+ L2+ dash+ CF+ C#+ blank-, or SE+ dF+ dash+ CF+ C#+ blank-, this indicates that the Controller Data Cache Memory DIMM has failed due to parity errors, and has placed the controller offline. Reference <> Troubleshooting Sun Storage[TM] 6580/6780 Cache Memory DIMM Faults
  • If 7-Segment Display flashes: SE+ dF+ dash+ CF+ H#+ blank-, this indicates that the Host Interface Card(HIC) in slot # for the controller is either failed or missing, reference <> Troubleshooting Sun Storage[TM] 6580/6780 Host Interface Card Faults.
  • If 7-segment Display flashes: SE+ L8+ blank+ CF+ Cx+ blank-, this indicates that the cache configuration does not match the alternate controller's configuration.  reference <> Troubleshooting Sun Storage[TM] 6580/6780 Cache Memory DIMM Faults.

If none of the errors above are displayed, go to Step 10.

5.  Online the RAID controller.

Make an attempt to online the RAID controller, using the user interface.  The symptoms that have been indicated, thus far, point to something other than a hardware problem on the RAID controller itself. 

Sun Storage[TM] Common Array Manager

Browser
  1. Expand Storage Array in the left window menu tree
  2. Click on your array name
  3. Click on the Service Advisor button in the top right corner of the browser window.
  4. Find and Expand Place a Controller Online in the Troubleshooting and Recovery Section
  5. Select the faulted controller, and follow the instructions in the right hand pane to place the controller online.
Service CLI

Locations:
Solaris: /opt/SUNWSefms/bin
Windows: c:\Program Files\Sun\Common Array Manager\Component\fms\bin
Linux: /opt/sun/private/fms/bin

service -d array_name -c revive -t [a | b]

NOTE: You must specify controller slot location A or B.


Sun StorageTek[TM] SANtricity Storage Manager

GUI
  1. Open the Array Management Window for your array
  2. Select the array controller (will have a red X on it)
  3. Open the Advanced Menu
  4. Select the Recovery Sub-Menu
  5. Select the Place Controller Sub-Menu
  6. Select Online
SMcli

SMcli -n array_name -c "set controller [(a|b)] availability=online;"

NOTE: You must specify controller slot location A or B.

  • If the request to online the controller fails, go to Step 11.
  • If the request to online the controller is successful, and the controller stays up for longer than 5 minutes, go to Step 6.
  • If the request to online the controller was successful, but the controller went offline again, go to Step 11.
6.  Reset SOC and RLS counters on the array for monitoring.

You have indicated that the array controller was successfully placed online and made available for longer than 5 minutes.  If this issue is intermittent, the controller may go offline again.  In order to help with diagnosis in the event that this occurs, we need to set baselines for error statistics on the array.

The RLS(Read Link Status and SOC(Switch On Chip) statistics are collected as part of normal array support collections, and can be zeroed out very easily for further diagnosis, as follows. Often, a controller will go offline due to a communication issue, which requires this data as part of the investigation.

NOTE:  This is not available for 2510, 2530, or 2540 arrays, although the collection of the error counters is.  If this is your array type, you do not need to run the commands, but should follow the instructions on what action to take, regardless.

Sun Storage Common Array Manager

Service CLI

Locations:
Solaris: /opt/SUNWSefms/bin
Windows: c:\Program Files\Sun\Common Array Manager\Component\fms\bin
Linux: /opt/sun/private/fms/bin

service -d array_name -c reset -t soc
service -d array_name -c reset -t rls

Sun StorageTek SANtricity Storage Manager

SMcli

SMcli -n array_name -c "reset storageArray RLSBaseline;"
SMcli -n array_name -c "reset storageArray SOCBaseline;"
  • If the array remains online for longer than 48 hours, monitor for a period of 2 weeks.  After that point, the problem was likely due to a software error or state inconsistency.  You may want to consider updating firmware  if available.  No further actions are required.
  • If the array controller is placed offline in less than 2 weeks, go to Step 11.
7.  Check for additional Critical Faults besides the RPA Memory Error

Due to bugs 6767241 and 6797173, the RPA Memory Error may be false.  Check the list of faults on the array, for any of the following:

REC_LOST_REDUNDANCY_DRIVE(xx.66.1076)
REC_PATH_DEGRADED(xx.66.1032)
  • If these faults exist, in addition to the RPA Memory Error, the error may be false.  Continue to Step 8 to review your firmware revisions.
  • If these faults do not exist on your array, go to Step 10.

8.  Verify your array firmware.

Use the following document to check your firmware against the table below:

<> Verify Sun Storage[TM] Array Firmware via the User Interface

Array Model Firmware Action
Sun StorageTek[TM] Flexline 380
Sun StorageTek[TM] 6540
Sun StorageTek[TM] 6140
07.50.xx.xx
07.60.xx.xx
You are not exposed to the bugs.  The controller should be replaced.  Continue to Step 10.
Sun StorageTek[TM] Flexline 380
Sun StorageTek[TM] 6540
Sun StorageTek[TM] 6140
07.10.xx.xx
07.15.xx.xx
You are exposed to 6767241, which causes false RPA Memory Errors, along with the faults in Step 6.  To correct the condition, go to Step 9.
Sun StorageTek[TM] Flexline 380
Sun StorageTek[TM] 6540
Sun StorageTek[TM] 6140
06.60.xx.xx
06.19.xx.xx
06.16.xx.xx
06.15.xx.xx
You are not exposed to the bugs.  The controller should be replaced.  Continue to Step 10.
Sun Storage[TM] 6580/6780
07.50.xx.xx
07.60.xx.xx
You are not exposed to the bugs.  The controller should be replaced.  Continue to Step 10.
Sun Storage[TM] 6580/6780
07.30.xx.xx
You are exposed to 6767241, which causes false RPA Memory Errors, along with the faults in Step 6.  To correct the condition, go to Step 9.
Sun StorageTek[TM] 2510/2530/2540
07.35.50.10
07.35.55.10
07.35.44.10
You are not exposed to the bugs.  The controller should be replaced.  Continue to Step 10.
Sun StorageTek[TM] 2510/2530/2540
07.35.10.10
You are exposed to 6767241, which causes false RPA Memory Errors, along with the faults in Step 6.  To correct the condition, go to Step 9.
Sun StorageTek[TM] 2510/2530/2540
06.70.xx.xx
06.17.xx.xx
You are not exposed to the bugs.  The controller should be replaced.  Continue to Step 10.

9.  Power Cycle the RAID Controller Tray to clear the false RPA memory error.

This procedure requires an outage, as the surviving controller will hold the faulted controller in a fault state.  The RAID Tray and only the RAID Tray require a power cycle.

After performing a power cycle of the RAID Tray, review the Critical Fault list in your user interface.

  • If the fault persists, the controller will require replacement, continue to Step 10.
  • If the fault is cleared, update your firmware to a version where 6767241 is fixed.

2510/2530/2540 Arrays this is 07.35.44.10 or later
6580/6780/6140/6540/Flexline 380 07.50.08.10 or later

10.  Have the controller replaced.

You have indicated that the 7-segment display on the array controller or a critical fault for an RPA Memory Error indicate that the RAID controller requires replacement.

Please supply:

Critical Fault
7-Segment display
Array Support Data Collection:
  • Reference <> Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager
  • Reference <> Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager
and contact Oracle.

11.  Provide Data for further analysis

At this point you have validated that each troubleshooting step is true for your environment and the issue still exists.  Therefore further troubleshooting is required to identify the issue.

Please provide:

7-Segment Display if available
Array Support Data Collection:
  • Reference <> Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager
  • Reference <> Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager
and contact Oracle.

Internal Comments
This document contains normalized content and is managed by the the Domain Lead
(s) of the respective domains. To notify content owners of a knowledge gap
contained in this document, and/or prior to updating this document, please
contact the domain engineers that are managing this document via the “Document
Feedback” alias(es) listed below:

storage-os-disk-mid-domain@sun.com

The Knowledge Work Queue for this article is KNO-STO-MIDRANGE_DISK.

WARNING!!!!

For HIC and Cache DIMM replacements on a 6580 and 6780, the module should be taken offline. 

While the array controllers can be removed for replacement of these components, you cannot have @the controller

out of the array for more than 15-30 minutes or the module will overheat, causing residual damage @to the remaining

components.

controller, fault, failed, offline, held in reset, reset, cam, santricity, array, type,
6180, 280, 380, 240, 6130, 6140, 6540, 2500, 6000, 2510, 2530, 2540, 6580, 6780, normalized


Change History

2010-06-11
User: DeCotis
Comment:  Updated entries for RPA Memory faults, dimm ts, and hic ts.

2009-11-26
Comment: corrected service cli syntax based on feedback.

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback