Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1300555.1
Update Date:2011-05-02
Keywords:

Solution Type  Sun Alert Sure

Solution  1300555.1 :   Replacement of Drives with Mechanical Positioning Errors May Cause RAID Controllers Reset or Lockdown Unexpectedly  


Related Items
  • Sun Storage Flexline 380 Array
  •  
  • Sun Storage 6580 Array
  •  
  • Sun Storage 6180 Array
  •  
  • Sun Storage 6540 Array
  •  
  • Sun Storage 2510 Array
  •  
  • Sun Storage 2540 Array
  •  
  • Sun Storage 6780 Array
  •  
  • Sun Storage 2530 Array
  •  
  • Sun Storage 6140 Array
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  




In this Document
  Description
  Likelihood of Occurrence
  Possible Symptoms
  Workaround or Resolution
  Modification History


Applies to:

Sun Storage Flexline 380 Array - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun Storage 6780 Array - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Storage 6580 Array - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Storage 2530 Array - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Storage 2540 Array - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Information in this document applies to any platform.
This issue also applies to Sun Storage 2510, 2530, 2540, 6140, 6180, 6540, 6580, and 6780 Arrays.
______________________




______________________

Date of Resolved Release:
02-Mar-2011

Description

Drives that have mechanical positioning errors may cause a RAID controller, see identified products, to reboot when the controllers attempt to fail that drive. The drive will be marked as failed when the controller completes SOD.

Note: Attempting to manually fail an affected drive may cause a lockdown when the rebooting controller is unable to verify the DACstore on the drive that is still reporting optimal to the survivor, causing an outage.

Likelihood of Occurrence

This issue can occur on the following system:
  • Sun StorageTek 6140 Arrays without Array Firmware 07.60.53.10 or later
  • Sun StorageTek 6540 Arrays without Array Firmware 07.60.53.10 or later
  • StorageTek Flexline 380 without Array Firmware 07.60.53.10 or later
  • Sun Storage 6180 Arrays without Array Firmware 07.60.53.10 or later
  • Sun Storage 6580 Arrays without Array Firmware 07.60.53.10 or later
  • Sun Storage 6780 Arrays without Array Firmware 07.60.53.10 or later
  • Sun StorageTek 2510 Arrays without Array Firmware 7.35.55.11 or later
  • Sun StorageTek 2530 Arrays without Array Firmware 7.35.55.11.or later
  • Sun StorageTek 2540 Arrays without Array Firmware 7.35.55.11 or later
This issue only occurs on arrays with one of the following drive models:
  • ST330055SSUN300G
  • ST330055FSUN300G
Note: Arrays with 6.xx firmware are not affected by this issue. Only a single controller will issue the command to fail the controller, thus no race condition exists.

Possible Symptoms

The aforementioned disk drives are failing due to a mechanical positioning error as seen in the array event logs, similar to the following:
      B:10/27/10 7:11:30 AM : 4050 : 0/0/0 : 6008 : Internal : Drive : Tray 33, Slot 16 : Stable storage drive unusable
A:10/27/10 7:10:47 AM : 4051 : 4/15/1 : 100A : Error : Drive : Tray 33, Slot 16 : Drive returned CHECK CONDITION :
Mechanical Positioning Error
A:10/27/10 7:10:49 AM : 4053 : 0/0/0 : 6008 : Internal : Drive : Tray 33, Slot 16 : Stable storage drive unusable
      Sense: 04 HARDWARE ERROR
ASC/ASCQ: 15/01 MECHANICAL POSITIONING ERROR

Which is a Drive Positioning Mechanical Error. These drives, in particular, have an existing issue with the heads staying in one position after a drive error log update. This reduces the lubrication in the drive heads leading to a head crash indicated by the codes mentioned above.

Manual failure of drives in an Optimal state, which results in one or more volumes in the array failing, typically lead to one or both controllers resetting, and possibly being held in a Lockdown state, as a result of access problems to the metadata on the array. This is due to a problem accessing and updating the metadata on the disk drive that is reporting the error.

The lockdown state may show as LU, 88, or SD on one controller of a 6140, 6540, or Flexline 380. The lockdown state may show as a flashing display on a 6180, 6580, or 6780 of OE+ LU+ blank-

Note: Controllers in a lockdown or offline state should be serviced immediately by Oracle support for correction.

Automatic failure of drives due to write failures caused by the aforementioned error, can also result in a controller reset similar to the following:
      B     Sat Jan 01 16:30:54 PST 2011     54527     4/15/1     100A     Error     Drive     Tray.01.Drive.03
Drive returned CHECK CONDITION - Mechanical Positioning Error
B Sat Jan 01 16:30:54 PST 2011 54528 204/15/1 1012 Error Drive Tray.01.Drive.03
Destination driver event - Mechanical Positioning Error
B Sat Jan 01 16:30:54 PST 2011 54529 0/0/0 6008 Notification Drive Tray.01.Drive.03
Stable storage drive unusable due to I/O errors
B Sat Jan 01 16:49:29 PST 2011 54530 0/0/0 100D Error Drive Tray.01.Drive.03
Timeout on drive side of controller
B Sat Jan 01 16:49:40 PST 2011 54531 0/0/0 100D Error Drive Tray.01.Drive.03
Timeout on drive side of controller
B Sat Jan 01 16:49:51 PST 2011 54532 0/0/0 100D Error Drive Tray.01.Drive.03
Timeout on drive side of controller
B Sat Jan 01 16:50:00 PST 2011 54533 201020b/0/0 1012 Error Drive Tray.01.Drive.03
Destination driver event - IO timeout
B Sat Jan 01 16:50:00 PST 2011 54534 0/0/0 201E Notification Controller
Tray.85.Controller.B VDD repair started
B Sat Jan 01 16:50:00 PST 2011 54535 0/0/0 201E Notification Controller
Tray.85.Controller.B VDD repair started
B Sat Jan 01 16:50:00 PST 2011 54536 0/0/0 201E Notification Controller
Tray.85.Controller.B VDD repair started
B Sat Jan 01 16:50:00 PST 2011 54537 0/0/0 2014 Notification Controller
Tray.85.Controller.B VDD logged an error
B Sat Jan 01 16:50:00 PST 2011 54538 0/0/0 201F Notification Controller
Tray.85.Controller.B VDD repair completed
B Sat Jan 01 16:50:00 PST 2011 54539 0/0/0 201F Notification Controller
Tray.85.Controller.B VDD repair completed
B Sat Jan 01 16:50:00 PST 2011 54540 0/0/0 201F Notification Controller
Tray.85.Controller.B VDD repair completed
B Sat Jan 01 16:50:01 PST 2011 54541 0/0/0 2226 Notification Drive
Tray.01.Drive.03 Drive spun down
B Sat Jan 01 16:50:01 PST 2011 54542 0/0/0 226C Failure Drive Tray.01.Drive.03
Drive failure detected
B Sat Jan 01 16:50:01 PST 2011 54543 0/0/0 2215 Notification Drive Tray.01.Drive.03
Drive marked failed
B Sat Jan 01 16:50:01 PST 2011 54544 0/0/0 2217 Notification Drive Tray.01.Drive.03
Piece failed
B Sat Jan 01 16:50:01 PST 2011 54545 0/0/0 2216 Notification Drive Tray.01.Drive.03
Piece taken out of service
B Sat Jan 01 16:50:01 PST 2011 54546 0/0/0 2217 Notification Drive Tray.01.Drive.03
Piece failed
B Sat Jan 01 16:50:02 PST 2011 54547 0/0/0 100D Error Drive Tray.01.Drive.03
Timeout on drive side of controller
B Sat Jan 01 16:51:02 PST 2011 54548 0/0/0 400F Notification Controller
Tray.85.Controller.A Controller reset by its alternate Reboot Reason: REBOOTALT_DBM_HEALTH_CHECK_EVENT
Note: A drive being failed by the system does not usually result in a lockdown or offline controller state. After a power cycle or controller reset, the drives often transition to a state of INCOMPATIBLE.

Workaround or Resolution

To work around the described issue, avoid manually failing drives. This will prevent the lockdown conditions requiring service intervention. In order to service drive replacement under these conditions, use the steps below to avoid the accessibility and availability issues referenced in the symptoms section. In general,
any configuration change effected by the user should be avoided until an affected drive is removed from the system.

1. Physically remove ALL Global Hot Spares from the system. Do NOT unassign them. This step is necessary to prevent the occurrence of the defect as any operation to remove a drive from the hot spare list through the user interface would provoke the defect.

2. Physically remove and replace the suspect disk(s). The Common Array Manager (CAM) Service Adviser removal and replacement procedures detail that the drive fault LEDs should be lit, and the status should be failed. Ignore this. The drive can be removed once the location is identified.

3.
CAM:
Use the disk replace procedure via CAM Service Advisor as outlined in “Service Advisor > Portable Virtual Disk Management>Replace a Disk Drive”. CAM allows for the pulled drives to be replaced with the same tray/slot that the new drives were inserted. Go directly to step 1 under “To Replace a Removed Disk Drive“and replace the drive with the same tray/slot as that which was just physically replaced. Once the CAM drive replacement procedure has been completed, volume group rebuild will start.

SANtricity:
Select the Volume Group which contains the replacement drive. Select Volume Group -> Replace Drives. Select the replacement drive and replace it by itself.

4. Re-insert the Global Hot Spare drives, any potentially failed hot spare drives should be unresponsive on insert and not cause a controller reboot. If these were not previously identified during initial analysis, have the failure of the drives analyzed and replaced as necessary.

This issue is resolved in CAM 6.7.0 with Firmware Patch 145965-02/145966-02/145967-02 or later.

Modification History

Document created March 2

14-Mar-2011: Updated Likelihood of Occurrence section
06-Apr-2011: Updated Workaround section
02-May-2011: Updated Workaround/Resolution section.

If a lockdown condition does occur, evaluate and respond as follows:
For controllers showing:
5d or Sd on one controller and 88 on the other
or
5d or Sd on one controller and LU on the other
1. Power Off the Array
2. Remove the controller showing 5d
3. Power Up
4. Serial into the booted controller, and run lemClearLockdown, then sysReboot
5. Wait for the controller to display the tray ID
6. Insert Remaining Controller
For controllers showing:
LU on one controller and the tray ID(85 or 99) on the other
1. serial into the LU controller, and run lemClearLockdown
2. Use the management interface to Online the Controller
3. The booting controller should show the same tray ID as the surviving one.
Internal Contributor/Submitter: curtis.decotis@oracle.com
Internal Eng Responsible Engineer: rich.floyd@oracle.com
Internal Services Knowledge Engineer: jeff.folla@oracle.com
Internal Eng Business Unit Group: NWS
Internal Escalation ID: 72785060, 73355762, 73346876, 73452350, 73430212, 73503190
Please send questions to the following email:
sunalertpublication_us@oracle.com
and copy the Responsible Engineer listed above

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback