Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1006856.1
Update Date:2010-01-20
Keywords:

Solution Type  Technical Instruction Sure

Solution  1006856.1 :   Troubleshooting StorEdge [TM] 351x Redundant Loop Failures  


Related Items
  • Sun Storage 3510 FC Array
  •  
  • Sun Storage 3511 SATA Array
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - 3xxx Arrays
  •  

PreviouslyPublishedAs
209520


Description
Description

Symptoms:

  • Redundant loop or path failures

  • Incorrect channel speed

  • Incrementing invalid transmission word and/or CRC errors

  • Failed or missing disks, controller, or IOM

  • Logical drive or multiple drive failures

  • Logical drive rebuilds or initialization may hang

Purpose/scope : This is a sub-set of <Document: 1011431.1> : "Troubleshooting Sun StorEdge[TM]  33x0/351x Hardware". The steps below will help verify and resolve fibre channel redundant path problems.



Steps to Follow
Steps to Follow
Step 1 - Check the eventlog or persistent eventlog and verify there are no redundant loop failures which may or may not be accompanied by multiple drive failures on the same loop by issuing

sccli> show eventlog
or
sccli> show persistent-eventlog

command.

For Example on the 3510:

sccli> show eventlog

Mon Jul 17 08:06:00 2006

[113f] #9: StorEdge Array SN#8011523 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)on Jul 17 06:52:59 2006
[113f] #10: StorEdge Array SN#8011523 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)on Jul 17 08:06:10 2006
[113f] #11: StorEdge Array SN#8011523 CH2: NOTICE: fibre channel loop connection restoredon Jul 17 06:53:08 2006
[113f] #12: StorEdge Array SN#8011523 CH2: NOTICE: fibre channel loop connection restoredon Jul 17 08:06:34 2006
[113f] #13: StorEdge Array SN#8011523 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)on Jul 17 08:16:43 2006
...
[2101] #19: LD-ID 436CE267 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID42)on Jul 17 08:16:43 2006
[2101] #20: LD-ID 72BE7D18 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID22)on Jul 17 08:16:46 2006
[2101] #21: LD-ID 00000000 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID5)on Jul 17 08:16:46 2006
[2101] #22: LD-ID 72BE7D18 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID25)on Jul 17 08:16:50 2006
[2101] #23: LD-ID 436CE267 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID43) on Jul 17 08:16:54 2006

Step 2 - Issue the sccli>show disks command to verify that multiple drives on same loop are not BAD:

Failure example on 3510:
sccli> show disks
Ch     Id      Size   Speed  LD     Status     IDs                   Rev  
(3) 34 N/A N/A NONE BAD SEAGATE ST336753FSUN36G 0349
S/N3HX1F0M400007412
WWNN2000000C505F89FA
(3) 35 N/A N/A NONE BAD SEAGATE ST336753FSUN36G 0349
S/N3HX1F09X00007412
WWNN2000000C505F8AAD
(3) 36 N/A N/A NONE BAD SEAGATE ST336753FSUN36G 0349
S/N3HX1F26800007412
WWNN2000000C505F8715
(3) 37 N/A N/A NONE BAD SEAGATE ST336753FSUN36G 0349
S/N3HX1EYJY00007412
WWNN2000000C505F8A28

Step 3 - Ensure that the ID switch settings are unique per enclosure and that disk id's are identified correctly as described in:
<Document: 1007692.1> : Sun StorEdge[TM] 351x FC Array switch settings and disk Ids.

Step 4 - Verify the diagnostic Invalid Transmission Word counters for the RAID devices are not increasing by comparing over time the output for the following sccli commands for each channel:
- show diag error channel 2
- show diag error channel 3

OR


If the sccli isn't available , and to capture data during I/O activity:

Check the Fibre Channel Error Statistics using the firmware interface as described in the Fibre Channel Error Statistics (FC and SATA Only) of the

Sun StorEdge[TM] 3000 Family RAID Firmware 4.2x User's Guide.

Monitor the following values for sharp increases on the RAID devices during I/O activity:

InvalTXWord.
Total number of instances of invalid transmission words. This error indicates either an invalid transmit word or disparity error.

InvalCRC.
Total number of instances of invalid CRC, or the number of times a frame was received and the CRC was not as expected.

For example, the following 3510 RAID device has high invalid transmission counts for the channel 2 controller (device id's 14 and 15):

sccli>
show diag error channel 2

CH ID TYPE LIP LinkFail LossOfSy LossOfSi PrimErr InvalTxW InvalCRC
------------------------------------------------------------------------
2 0 DISK 59 0 3 0 0 450311 0
2 1 DISK 59 0 1 0 0 476834 0
2 2 DISK 59 0 5 0 0 456602 0
2 3 DISK 59 0 1 0 0 450818 0
2 4 DISK 59 0 1 0 0 450556 0
....
2 39 DISK 59 0 1 0 0 448454 0
2 40 DISK 59 0 1 0 0 451082 0
2 41 DISK 59 0 3 0 0 448987 0
2 42 DISK 59 0 5 0 0 450288 0
2 43 DISK 59 0 1 0 0 448025 0
2 44 SES 59 0 0 0 0 0 0
2 14 RAID 59 0 0 0 0 20863 0
2 15 RAID 59 0 0 0 0 20840 0


If counters are increasing:
-Investigate back-end loop device order to understand what is just before any devices showing high error counts
-Investigate the device just BEFORE the device reporting high error counts
-If there are invalid transmission counts or CRC errors for the raid devices 14 and 15 this may be indicative of a mis-seated or marginal component

Step 5 - Issue the sccli> show channels command and ensure that all of the configured ports are running at the correct speed:

3510 Example where Loop B is at incorrect speed:

sccli>show channels
Ch  Type    Media   Speed   Width  PID / SID
0 Host FC(L) 2G Serial 40 / NA
1 Host FC(L) 2G Serial 43 / NA
2 Drive FC(L) 2G Serial 14 / 15
3 Drive FC(L) ASYNC Serial 14 / 15 <----Loop B is async
4 Host FC(L) 2G Serial 44 / NA
5 Host FC(L) 2G Serial 47 / NA


Step 6 - Issue the sccli> show enclosure-status command to ensure both loop a and b are visible.

3510 example (no problem):

sccli>show enclosure-status

Ch Id Chassis Vendor/Product ID Rev PLD WWNN WWPN
-------------------------------------------------------------------------------
2 12 0859A9 SUN StorEdge 3510F A 1080 1000 204000C0FF0859A9 214000C0FF0859A9 Topology: loop(a) Status: OK
3 12 0859A9 SUN StorEdge 3510F A 1080 1000 204000C0FF0859A9 224000C0FF0859A9 Topology: loop(b) Status: OK
In this example, we only see loop A:

sccli>show enclosure-status
sccli: selected device /dev/rdsk/c3t44d0s2 [SUN StorEdge 3510 SN#0033xx]

Ch Id Chassis Vendor/Product ID Rev PLD WWNN WWPN
-------------------------------------------------------------------------------
2 12 00331B SUN StorEdge 3510F A 1080 1000 204000C0FF00331B 214000C0FF00331B Topology: loop(a)
2 28 003759 SUN StorEdge 3510F D 1080 1000 205000C0FF003759 215000C0FF003759 Topology: loop(a)
2 44 004E0A SUN StorEdge 3510F D 1080 1000 205000C0FF004E0A 215000C0FF004E0A Topology: loop(a)


Step 7 - Issue the following sccli commands and verify both controllers and all devices are visible on each loop and again verify device id's are correct:

-show loop-map channel 3
-show loop-map channel 2

For example on 3510:

sccli>show loop-map channel 2
sccli: selected devi
PORT    ENCL-ID ENCL-TYPE       LOOP    BYP-STATUS      ATTRIBUTES
---- ------- --------- ---- ---------- SH--------
0 0 RAID LOOP-B Not-Installed --
1 0 RAID LOOP-B Unbypassed --
L 0 RAID LOOP-B Not-Installed --
R 0 RAID LOOP-B Unbypassed --
4 0 RAID LOOP-B Not-Installed --
5 0 RAID LOOP-B Unbypassed --
L 1 JBOD LOOP-B Unbypassed --
R 1 JBOD LOOP-B Not-Installed --
L 2 JBOD LOOP-B Unbypassed --
R 2 JBOD LOOP-B Unbypassed –

There are additional show bypass commands that can be used to verify device and raid status:

For example:

sccli> show bypass raid

SLOT LOOP BYP-STATUS
---- ---- ----------
TOP LOOP-A Unbypassed
TOP LOOP-B Unbypassed
BOTTOM LOOP-A Unbypassed
BOTTOM LOOP-B Unbypassed


Refer to the Sun StorEdge[TM] 3000 Family RAID Firmware 4.2x User's Guide for details

Step 9 - Check the sccli> show fru output, to determine there are no N/A or absent components on the loop, specifically the IOM or controller. If there are jbods attached, determine that they are visible as well.

For example on 3510 from show fru output we cannot see the lower IOM (ch3) on raid array:

Name: FC_RAID_IOM
Description: N/A
Part Number: N/A
Serial Number: N/A
Revision: N/A
Initial Hardware Dash Level: N/A
F Manufacturing Date: N/A
Manufacturing Location: N/A
Manufacturer JEDEC ID: N/A
FRU Location: LOWER FC RAID IOM SLOT
Chassis Serial Number: 00331B
FRU Status:
Absent

Entry should show up as:
Name: FC_RAID_IOM
Description: SE3510 I/O w/SES + RAID Cont 1GB
Part Number: 370-5537
Serial Number: 0029xx
Revision: 02
Initial Hardware Dash Level: 02
RU Shortname: N/A
FRU Shortname: 370-5537-02
Manufacturing Date: Fri Jun 27 20:52:58 2003
Manufacturing Location: Milpitas,CA,USA
Manufacturer JEDEC ID: 0x0301
FRU Location: LOWER FC RAID IOM SLOT
Chassis Serial Number: 00331B
FRU Status: OK

Step 10 - For failures as described above, steps to troubleshoot would include:

-Verify that IOM/controller is not mis-seated or failed. Refer to <Document: 1002641.1> : Troubleshooting the StorEdge [TM] 33x0/351x Controller
-Verify cabling is correct. Refer to  <Document: 1008193.1> : Troubleshooting StorEdge [TM] 351x Cabling
-Verify there are no unused SFP's in the drive channels 2 and 3 on each controller.
-Verify firmware levels for controller, PLD and SES are at the latest revision.
-Hardware components that may need reseating include: SFP, cable, disk(s), controller/IOM.

Step 11 – If hardware fault persists gather the latest explorer information and escalate appropriately.
Step 12 - If no problems were found during the course of this document please refer back to  <Document: 1011431.1> : Troubleshooting Sun StorEdge 33x0/351x Hardware.


Product
Sun StorageTek 3511 SATA Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3510 2U FC Array

Internal Comments
This document contains normalized content and is managed by the the Domain Lead(s) of the respective domains. To notify content owners of a knowledge gap contained in this document, and/or prior to updating this document, please contact the domain engineers that are managing this document via the “Document Feedback” alias(es) listed below:

storage-os-disk-low-domain@sun.com




fibre channel, path failure, loop down, loop up, 3510, 3511, 351x, normalized
Previously Published As
89049

Change History
Date: 2010-01-20
User Name: brian.jackson@sun.com
Action: Externalized
Comment: Checked, made contract customer facing
Version: 13
Date: 2007-12-04
User Name: 7058
Action: Approved
Comment: Updates OK to publish
Version: 13
Comment: No changes made as indicated by Vickie, so placing in
final review as there are no changes to review.
It may not be perfect, but with DocBook/Voyager translation bugs, this is the very best I could do. It literally took hours.

Comment: I found a reference to doc ID 76756 which is an internal only doc.
It is about details for a particular command, so moving it to the internal only section of this doc won't cause a huge problem. I'm moving it to the internal only section. This will allow us to move forward with Minnow normalization.


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback