Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1000686.1 : Sun StorEdge 6920 Array With Drive(s) in Marginal State May Experience I/O Timeout and Loss of Access to Components
PreviouslyPublishedAs 200898 Product Sun StorageTek 6920 System Bug Id <SUNBUG: 6359654>, <SUNBUG: 6366175> Date of Workaround Release 27-FEB-2006 Date of Resolved Release 31-Mar-2008 A marginal drive in a Sun StorEdge 6920 array can adversely affect the array ... see below: 1. Impact A marginal drive in a Sun StorEdge 6920 array can adversely affect the array to the point of I/O timeout or loss of access to all its components. The severity of impact depends on the length of time that the drive has been marginal and the use of Data Services (Local Mirror, Snapshot, Remote Replication) employed in the configuration. 2. Contributing Factors This issue can occur on the following platform:
To determine the release running on the array, do the following: 1. Log in to the "Storage Automated Diagnostic Environment" web console as the "storage" user. Sun Storage Automated Diagnostic Environment -> Inventory -> Service Processor 2. Check the "Details" view for the version listed (e.g. 3.0.0.25). 3. Symptoms The host will experience a number of symptoms that manifest as poor performance and/or hardware communication problems with the array. While the performance issue is subjective, the host's /var/adm/messages file (or error log) may show SCSI messages as seen in the following with a Solaris host: Jan 11 21:23:36 myhost /scsi_vhci/ssd@g600015d000045300000000000000095b (ssd21): Command Timeout on path /pci@5d,600000/SUNW,qlc@1/fp@0,0 (fp0) Jan 11 21:23:36 myhost scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci/ssd@g600015d000045300000000000000095b (ssd 21): Jan 11 21:23:36 myhost SCSI transport failed: reason 'timeout': retrying command The 6920 itself may show a variety of symptoms as a result of the marginal drive, such as a status of "degraded" when viewed by Configuration Services (Web Console or sscs(1M)). These states depend upon whether a StorADE Alarm has been triggered as a result of this event, and whether one or more Volumes or VDisks are not in an optimal state. Alarms may not be triggered due to the majority of the drive errors being of type "Log." As a result, customers using notifications through Sun Storage Remote Response (SSRR) or native email through StorADE will not be notified of this situation. The bulk of the notifications are logged in StorADE via the Events Log, and can be seen by logging in to the StorADE Web Console: StorADE -> Administration Tab -> Events Log Tab Solution Extract Directory/Storade/Event.log (See details below for the extract process) The following are examples of common events seen with this issue: Event 1 (6020.LogEvent) - Viewed using StorADE Events Log Tab: Date Event Details Device Component Type 12/16/2005 18:54:15 Details* array10 M.disk.u3d8 Log Viewed in Events.log: 20051216185415.000006+0000 6020.LogEvent.array_warning M.disk.u3d8 4 device_warning(s) found in logfile /var/adm/messages.array (related to 6020 array10/192.168.0.50) : (TimeZone GMT)Dec 16 18:53:25 array10 ISR1[3]: W:u3d08 SC SI error occurred: Medium Error (sense key = 0x3). Read RetriesExhausted. :: (last event on this subject for next 1 days): Sev:1 Action:FALSE Enc:0x301.5405318.413946 mgmtLevel:C AgentH:local Agg:-1 ECode:6020.LogEvent.array_warning Topic: LDec 16 18:53:25 array10 ISR1[3]: W: u3d08 SCSI error occurred: Medium Error (sense key = 0x3). Read Retries Exhausted. LDec 16 18:53:40 array10 ISR1[1]: W: u3d08 SCSI error occurred: Medium Error (sense key = 0x3). Mechanical Positioning Error. LDec 16 18:53:58 array10 ISR1[3]: W: u3d08 SCSI error occurred: Medium Error (sense key = 0x3). Read Retries Exhausted. LDec 16 18:54:08 array10 ISR1[1]: W: u3d08 SCSI error occurred: Medium Error (sense key = 0x3). Mechanical Positioning Error. Event 2 (6020.LogEvent) - Viewed Using StorADE Events Log Tab: Date Event Details Device Component Type 12/16/2005 19:09:58 Details* array10 M.disk.u3d8 Log Viewed in Events.log: 20051216190958.000002+0000 6020.LogEvent.array_warning M.disk.u3d8 1 device_warning(s) found in logfile /var/adm/messages.array (related to 6020 array10/192.168.0.50): (TimeZone GMT)Dec 16 18:51:38 array10 ISR1[3]: W: u3d08 SCSI error occurred: Medium Error (sense key = 0x3). Read Retries Exhausted.::Sev:1 Action:FALSE Enc:0x301.5405318.413946 mgmtLevel:C AgentH:local Agg:5 ECode:6020.LogEvent.array_warning Topic: Event 3 (dsp.LogEvent) - Viewed using StorADE Events Log Tab: Date Event Details Device Component Type 12/16/2005 19:20:06 Details* dsp00 M.FIBRE_CHANNEL Log Viewed in Events.log: 20051216192006.000000+0000 dsp.LogEvent.log_warning M.FIBRE_CHANNEL 2 device_warning(s) found in logfile /var/adm/messages.dsp (related todsp dsp00/192.168.0.10) : Dec 16 18:56:40 dsp00 12/16/2005 18:00:14 LOG_WARNING(FIBRE_CHANNEL: 2-4) isp1: 102 New Command Timeouts have occurred:: Sev:1 Action:FALSE Enc:210000015d045300 mgmtLevel:C AgentH:local Agg:-1 ECode:dsp.LogEvent.log_warning Topic: LDec 16 18:56:40 dsp00 12/16/2005 18:00:14 LOG_WARNING (FIBRE_CHANNEL: 2-4) isp1: 102 New Command Timeouts have occurred "Details" link opens a new page showing the contents after the "Topic:" entry for each event in the log. Note: The 6020 messages can vary dramatically depending on why the drive is failing or the subsystem load. Reviewing a solution extract from StorADE will allow a detailed review of the subsystem state. To collect a solution extract: StorADE -> Administration Tab -> Utilities Tab -> Solution Extract Tab -> Extract then, to view the extract: Solution Extract Tab -> "View Content" (for the date of the extract) Storade/Events.log Sp/messages/var_adm/messages.dsp Sp/messages/var_adm/messages.array Messages may include the following (from the messages.dsp file): Jan 18 13:54:28 dsp00 01/18/2006 20:53:55 LOG_INFO (iSCSI: 3-1) Initiator timed-out command: connection=0xa00000d Jan 18 13:54:31 dsp00 01/18/2006 20:53:59 LOG_INFO (iSCSI: 4-1) Initiator timed-out command: connection=0xa00000d Jan 18 13:54:31 dsp00 last message repeated 5 times Jan 18 13:54:31 dsp00 01/18/2006 20:53:59 LOG_WARNING (SCSI: 4-3) Excessive retries (100 in 639 sec, 600 total) ALU 60003BACCBC38000435971F200094BC1 Io Timeout The disk drive may also cause a RAID controller failure due to a communication failure along the drive channels, as shown in the following messages.array file: Jan 11 21:21:18 LMON[2]: E: RAS: u2d13 port on Loop 1 is experiencing intermittent faults. Drive shall be disabled after copy reconstruction completes. Jan 11 21:21:21 LPCT[1]: N: u2d13 Bypassed on loop 1 Jan 11 21:21:29 LT09[1]: N: u2d13 Copying drive to standby disk (u1d14) started Jan 11 21:47:18 LMON[2]: E: RAS: u2d13 port on Loop 2 is experiencing intermittent faults. Drive is disabled. and: Feb 14 09:31:52 IPCS[1]: N: u3ctr: Inter-controller communication failed Feb 14 09:31:52 TMON[1]: W: u3ctr cannot read from thermal sensor and the communication failure between the controllers: Jan 11 21:48:12 IPCS[1]: N: u2ctr Inter-controller communication failed: Receiver offline Jan 11 21:48:33 LPCT[1]: N: u2l1 Controller off the loop Jan 11 21:48:34 LPCT[1]: N: u2l2 Controller off the loop Jan 11 21:48:39 IPCS[1]: N: u2ctr Inter-controller communication failed: Receiver offline Jan 11 21:48:41 IPCS[1]: N: u2ctr Inter-controller communication failed: Receiver offline4. Workaround Relief for this issue involves identification of the marginal/failing drive and standard replacement procedure followed by the repair of the state of the array. (The sooner this procedure is completed, the less of an impact to the 6920). This general procedure is defined as follows: 1) Identify the marginal/failing drive by reviewing the Events Log in StorADE 2) Remove the marginal/failing disk from the array 3) Verify that all Volumes and VDisks are in an optimal (OK) state by doing the following: For Volumes: Web Console -> Configuration Services -> Volumes Tab sscs list volume volume_name For Virtual disks: Web Console -> Configuration Services -> Virtual Disks sscs list vdisk vdisk_name 4) Verify the states of the trays are OK: Configuration Services -> Physical Storage 5) Verify that host access and performance is restored to the affected 6920 volumes. This may involve some recovery activities for filesystems, volume managers, and other upper layer software on the hosts. 6) The replacement drive may now be inserted into any free slot in the tray. This will start a copy-back operation if a spare was available during step 2, or start a reconstruction from available parity. Note: Having a Solution Extract available for analysis is necessary if the array, volume, virtual disk, or host access is not optimal after the above procedure. 5. Resolution There are no further updates planned for this Sun Alert document. If Modification History 31-Mar-2008: no further updates. Resolved. Previously Published As 102193 Internal Comments We took this as far as we have allowed a customer to access most 6920s without service intervention. Services should observe whether it is necessary to manually enable a failed 6020 tray controller to restore complete functionality. A dsp reboot may be necessary to completely address the issue. In general the order of operations to fix the issue is: -Repair the array by removing the drive, and fixing the controller states -Repair the DSP("reboot now" may be needed) -Repair the host connectivity Internal Contributor/submitter Michiel.Bijlsma@sun.com Internal Eng Business Unit Group NWS (Network Storage) Internal Eng Responsible Engineer Michiel.Bijlsma@sun.com Internal Services Knowledge Engineer david.mariotto@sun.com Internal Escalation ID 1-14408862, 1-14202409, 1-13947352 1-13950521, 1-13559924 1-14901226, 1-13734116, 1-14913081, 1-15093480, 1-15144772 Internal Sun Alert Kasp Legacy ID 102193 Internal Sun Alert & FAB Admin Info Critical Category: Availability ==> Pervasive Significant Change Date: 2006-02-27 Avoidance: Workaround Responsible Manager: lorraine.dilliway@sun.com Original Admin Info: [WF 27-Feb-2006, Dave M: review completed, curtis and NWS also says OK for release] [WF 24-Feb-2006, Dave M: sending for review] [WF 23-Feb-2006, Dave M: draft created] Product_uuid 67794720-356d-11d7-8ef2-ce2ac2bc9136|Sun StorageTek 6920 System Attachments This solution has no attachment |
||||||||||||
|