Sun StorEdge 6920 Array With Drive(s) in Marginal State May Experience I/O Timeout and Loss of Access to Components

Asset ID:	1-77-1000686.1
Update Date:	2011-02-22
Keywords:

Solution Type Sun Alert Sure

Solution 1000686.1 : Sun StorEdge 6920 Array With Drive(s) in Marginal State May Experience I/O Timeout and Loss of Access to Components

Related Items


Sun Storage 6920 System

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
200898

Product
Sun StorageTek 6920 System

Bug Id
<SUNBUG: 6359654>, <SUNBUG: 6366175>

Date of Workaround Release
27-FEB-2006

Date of Resolved Release
31-Mar-2008

A marginal drive in a Sun StorEdge 6920 array can adversely affect the array ... see below:

1. Impact

A marginal drive in a Sun StorEdge 6920 array can adversely affect the array to the point of I/O timeout or loss of access to all its components. The severity of impact depends on the length of time that the drive has been marginal and the use of Data Services (Local Mirror, Snapshot, Remote Replication) employed in the configuration.

2. Contributing Factors

This issue can occur on the following platform:

Sun StorEdge 6920 array running release 3.0.0.25 or higher

To determine the release running on the array, do the following:

1. Log in to the "Storage Automated Diagnostic Environment" web console as the "storage" user.

Sun Storage Automated Diagnostic Environment -> Inventory -> Service Processor

2. Check the "Details" view for the version listed (e.g. 3.0.0.25).

3. Symptoms

The host will experience a number of symptoms that manifest as poor performance and/or hardware communication problems with the array. While the performance issue is subjective, the host's /var/adm/messages file (or error log) may show SCSI messages as seen in the following with a Solaris host:

    Jan 11 21:23:36 myhost
    /scsi_vhci/ssd@g600015d000045300000000000000095b (ssd21):
    Command Timeout on path /pci@5d,600000/SUNW,qlc@1/fp@0,0 (fp0)
    Jan 11 21:23:36 myhost scsi: [ID 243001 kern.warning] WARNING:
    /scsi_vhci/ssd@g600015d000045300000000000000095b (ssd 21):
    Jan 11 21:23:36 myhost SCSI transport failed: reason 'timeout':
    retrying command

The 6920 itself may show a variety of symptoms as a result of the marginal drive, such as a status of "degraded" when viewed by Configuration Services (Web Console or sscs(1M)). These states depend upon whether a StorADE Alarm has been triggered as a result of this event, and whether one or more Volumes or VDisks are not in an optimal state. Alarms may not be triggered due to the majority of the drive errors being of type "Log." As a result, customers using notifications through Sun Storage Remote Response (SSRR) or native email through StorADE will not be notified of this situation.

The bulk of the notifications are logged in StorADE via the Events Log, and can be seen by logging in to the StorADE Web Console:

StorADE -> Administration Tab -> Events Log Tab

Solution Extract Directory/Storade/Event.log

(See details below for the extract process)

The following are examples of common events seen with this issue:

Event 1 (6020.LogEvent) - Viewed using StorADE Events Log Tab:

    Date		     Event Details  Device    Component    Type
    12/16/2005 18:54:15      Details* 	    array10   M.disk.u3d8  Log

Viewed in Events.log:

    20051216185415.000006+0000      6020.LogEvent.array_warning
    M.disk.u3d8 4 device_warning(s) found in logfile /var/adm/messages.array
    (related to 6020 array10/192.168.0.50) : (TimeZone GMT)Dec 16 18:53:25
    array10 ISR1[3]: W:u3d08 SC
    SI error occurred: Medium Error (sense key = 0x3). Read RetriesExhausted.
    :: (last event on this subject for next 1 days):   Sev:1   Action:FALSE
    Enc:0x301.5405318.413946        mgmtLevel:C     AgentH:local    Agg:-1
    ECode:6020.LogEvent.array_warning       Topic:
    LDec 16 18:53:25 array10 ISR1[3]: W: u3d08 SCSI error occurred: Medium
    Error (sense key = 0x3). Read Retries Exhausted.
    LDec 16 18:53:40 array10 ISR1[1]: W: u3d08 SCSI error occurred: Medium
    Error (sense key = 0x3). Mechanical Positioning Error.
    LDec 16 18:53:58 array10 ISR1[3]: W: u3d08 SCSI error occurred: Medium
    Error (sense key = 0x3). Read Retries Exhausted.
    LDec 16 18:54:08 array10 ISR1[1]: W: u3d08 SCSI error occurred: Medium
    Error (sense key = 0x3). Mechanical Positioning Error.

Event 2 (6020.LogEvent) - Viewed Using StorADE Events Log Tab:

    Date		     Event Details  Device    Component    Type
    12/16/2005 19:09:58      Details*	    array10   M.disk.u3d8  Log

Viewed in Events.log:

    20051216190958.000002+0000 6020.LogEvent.array_warning
    M.disk.u3d8 1 device_warning(s) found in logfile /var/adm/messages.array
    (related to 6020 array10/192.168.0.50):
    (TimeZone GMT)Dec 16 18:51:38 array10 ISR1[3]: W:
    u3d08 SCSI error occurred: Medium Error (sense key = 0x3). Read Retries
    Exhausted.::Sev:1   Action:FALSE    Enc:0x301.5405318.413946  mgmtLevel:C
    AgentH:local    Agg:5   ECode:6020.LogEvent.array_warning  Topic:

Event 3 (dsp.LogEvent) - Viewed using StorADE Events Log Tab:

    Date		     Event Details  Device    Component        Type
    12/16/2005 19:20:06      Details*	    dsp00     M.FIBRE_CHANNEL  Log

Viewed in Events.log:

    20051216192006.000000+0000      dsp.LogEvent.log_warning
    M.FIBRE_CHANNEL 2 device_warning(s) found in logfile /var/adm/messages.dsp
    (related todsp dsp00/192.168.0.10) : Dec 16 18:56:40 dsp00  12/16/2005 18:00:14
    LOG_WARNING(FIBRE_CHANNEL: 2-4)  isp1:
    102 New Command Timeouts have occurred:: Sev:1
    Action:FALSE    Enc:210000015d045300    mgmtLevel:C     AgentH:local    Agg:-1
    ECode:dsp.LogEvent.log_warning  Topic:
    LDec 16 18:56:40 dsp00  12/16/2005 18:00:14 LOG_WARNING  (FIBRE_CHANNEL:
    2-4)  isp1: 102 New Command Timeouts have occurred

"Details" link opens a new page showing the contents after the "Topic:" entry for each event in the log.

Note: The 6020 messages can vary dramatically depending on why the drive is failing or the subsystem load.

Reviewing a solution extract from StorADE will allow a detailed review of the subsystem state. To collect a solution extract:

StorADE -> Administration Tab -> Utilities Tab -> Solution Extract Tab -> Extract

then, to view the extract:

Solution Extract Tab -> "View Content" (for the date of the extract)

    Storade/Events.log
    Sp/messages/var_adm/messages.dsp
    Sp/messages/var_adm/messages.array

Messages may include the following (from the messages.dsp file):

    Jan 18 13:54:28 dsp00  01/18/2006 20:53:55 LOG_INFO (iSCSI: 3-1)
    Initiator timed-out command: connection=0xa00000d
    Jan 18 13:54:31 dsp00  01/18/2006 20:53:59 LOG_INFO (iSCSI: 4-1)
    Initiator timed-out command: connection=0xa00000d
    Jan 18 13:54:31 dsp00 last message repeated 5 times
    Jan 18 13:54:31 dsp00  01/18/2006 20:53:59 LOG_WARNING (SCSI: 4-3)
    Excessive retries (100 in 639 sec, 600 total) ALU
    60003BACCBC38000435971F200094BC1  Io Timeout

The disk drive may also cause a RAID controller failure due to a communication failure along the drive channels, as shown in the following messages.array file:

    Jan 11 21:21:18 LMON[2]: E: RAS: u2d13 port on Loop 1 is experiencing
    intermittent faults. Drive shall be disabled after copy reconstruction
    completes.
    Jan 11 21:21:21 LPCT[1]: N: u2d13 Bypassed on loop 1
    Jan 11 21:21:29 LT09[1]: N: u2d13 Copying drive to standby disk (u1d14)
    started
    Jan 11 21:47:18 LMON[2]: E: RAS: u2d13 port on Loop 2 is experiencing
    intermittent faults. Drive is disabled.

and:

    Feb 14 09:31:52 IPCS[1]: N: u3ctr: Inter-controller communication failed
    Feb 14 09:31:52 TMON[1]: W: u3ctr cannot read from thermal sensor

and the communication failure between the controllers:

    Jan 11 21:48:12 IPCS[1]: N: u2ctr Inter-controller communication failed:
    Receiver offline
    Jan 11 21:48:33 LPCT[1]: N: u2l1 Controller off the loop
    Jan 11 21:48:34 LPCT[1]: N: u2l2 Controller off the loop
    Jan 11 21:48:39 IPCS[1]: N: u2ctr Inter-controller communication failed:
    Receiver offline
    Jan 11 21:48:41 IPCS[1]: N: u2ctr Inter-controller communication failed:
    Receiver offline

4. Workaround

Relief for this issue involves identification of the marginal/failing drive and standard replacement procedure followed by the repair of the state of the array. (The sooner this procedure is completed, the less of an impact to the 6920).

This general procedure is defined as follows:

1) Identify the marginal/failing drive by reviewing the Events Log in StorADE

2) Remove the marginal/failing disk from the array

3) Verify that all Volumes and VDisks are in an optimal (OK) state by doing the following:

For Volumes:

Web Console -> Configuration Services -> Volumes Tab

sscs list volume volume_name

For Virtual disks:

Web Console -> Configuration Services -> Virtual Disks

sscs list vdisk vdisk_name

4) Verify the states of the trays are OK:

Configuration Services -> Physical Storage

5) Verify that host access and performance is restored to the affected 6920 volumes. This may involve some recovery activities for filesystems, volume managers, and other upper layer software on the hosts.

6) The replacement drive may now be inserted into any free slot in the tray. This will start a copy-back operation if a spare was available during step 2, or start a reconstruction from available parity.

Note: Having a Solution Extract available for analysis is necessary if the array, volume, virtual disk, or host access is not optimal after the above procedure.

5. Resolution

There are no further updates planned for this Sun Alert document. If
you need additional assistance regarding this issue, please contact Sun
Services.


This Sun Alert notification is being provided to you on an "AS IS"
basis. This Sun Alert notification may contain information provided by
third parties. The issues described in this Sun Alert notification may
or may not impact your system(s). Sun makes no representations,
warranties, or guarantees as to the information contained herein. ANY
AND ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR
NON-INFRINGEMENT, ARE HEREBY DISCLAIMED. BY ACCESSING THIS DOCUMENT YOU
ACKNOWLEDGE THAT SUN SHALL IN NO EVENT BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES THAT ARISE OUT
OF YOUR USE OR FAILURE TO USE THE INFORMATION CONTAINED HEREIN. This
Sun Alert notification contains Sun proprietary and confidential
information. It is being provided to you pursuant to the provisions of
your agreement to purchase services from Sun, or, if you do not have
such an agreement, the Sun.com Terms of Use. This Sun Alert
notification may only be used for the purposes contemplated by these
agreements.

 
Copyright 2000-2008 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, CA 95054 U.S.A. All rights reserved.

Modification History
31-Mar-2008: no further updates. Resolved.

Previously Published As
102193
Internal Comments

We took this as far as we have allowed a customer to access most 6920s without service intervention.

Services should observe whether it is necessary to manually enable a failed 6020 tray controller to restore complete functionality.

A dsp reboot may be necessary to completely address the issue.

In general the order of operations to fix the issue is:

-Repair the array by removing the drive, and fixing the controller states

-Repair the DSP("reboot now" may be needed)

-Repair the host connectivity

Internal Contributor/submitter
Michiel.Bijlsma@sun.com

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
Michiel.Bijlsma@sun.com

Internal Services Knowledge Engineer
david.mariotto@sun.com

Internal Escalation ID
1-14408862, 1-14202409, 1-13947352 1-13950521, 1-13559924 1-14901226, 1-13734116, 1-14913081, 1-15093480, 1-15144772

Internal Sun Alert Kasp Legacy ID
102193

Internal Sun Alert & FAB Admin Info
Critical Category: Availability ==> Pervasive
Significant Change Date: 2006-02-27
Avoidance: Workaround
Responsible Manager: lorraine.dilliway@sun.com
Original Admin Info: [WF 27-Feb-2006, Dave M: review completed, curtis and NWS also says OK for release]
[WF 24-Feb-2006, Dave M: sending for review]
[WF 23-Feb-2006, Dave M: draft created]

Product_uuid
67794720-356d-11d7-8ef2-ce2ac2bc9136|Sun StorageTek 6920 System

Attachments

This solution has no attachment