Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1000681.1
Update Date:2010-07-22
Keywords:

Solution Type  Sun Alert Sure

Solution  1000681.1 :   Failed Controller Condition May Cause Data Integrity Issues  


Related Items
  • Sun Storage 3510 FC Array
  •  
  • Sun Storage 3310 Array
  •  
  • Sun Storage 3511 SATA Array
  •  
  • Sun Storage 3320 SCSI Array
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Data Loss
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  

PreviouslyPublishedAs
200893


Product
Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array

Bug Id
<SUNBUG: 6355818>

Date of Workaround Release
13-DEC-2005

Date of Resolved Release
15-JUN-2006

Impact

On a Sun StorEdge 33x0/35xx array, when a failed RAID controller condition exists and the array is power cycled, data integrity issues may occur.


Contributing Factors

This issue can occur on the following platforms:

  • Sun StorEdge 3310 SCSI array without firmware 4.15F (as delivered in patch 113722-15)
  • Sun StorEdge 3320 SCSI array without firmware 4.15G (as delivered in patch 113730-01)
  • Sun StorEdge 3510 FC array without firmware 4.15F (as delivered in patch 113723-15)
  • Sun StorEdge 3511 FC/SATA array without firmware 4.15F (as delivered in patch 113724-09)

Note: This issue can occur with all current firmware revisions available for the Sun StorEdge 33x0/35xx arrays.

This issue can occur when a default primary RAID controller failure condition exists and the array is power cycled during that time, resulting in stale cache data (contained in the failed controller) being written unexpectedly to disk.

Note: The default primary RAID controller is the controller with the higher serial number. This can be determined via the CLI by using the sccli syntax "sccli> show redundancy," and is not the serial number on the back of the FRU.


Symptoms

Upon power cycling the array, the failed controller comes online and the existing filesystems on the array report fsck(1M) or other data integrity issues.


Workaround

The failed controller's cache needs to be discarded or the failed controller must be removed from the array prior to resetting or power cycling the array.

Note: Always replace the failed controller with the power on (the array can be power cycled with just one controller).

Scenario 1 - Spare controller available:

With the power on and the array operational on one controller, install the replacement controller for the failed controller. (Removing the failed controller with power on before the array is power cycled will not allow any stale data from the failed controller to be written out).

Scenario 2 - Spare controller unavailable:

Option 1: Assumption is you have sccli in-band or out-of-band access to the array.

Unfail the controller using sccli syntax "sccli> unfail." The failed controller's cache will be discarded when the controller is put back on-line as a Secondary controller. If this command fails, follow Option 2.

Option 2: Prior to resetting or power cycling the array, remove the failed controller, and then remove the battery module for at least 5 seconds on the failed controller and reinsert battery to invalidate the cache on the failed controller. To maintain proper air flow, partially reinsert controller until it is 1 inch from full reseating location.

Please refer to Sun documentation at: http://docs.sun.com/app/docs/doc/816-7326-20

Please refer to the corresponding documents for the required firmware levels to identify failed controllers:

"Sun StorEdge 3000 Family Installation, Operation, and Service Manual" collection at http://www.sun.com/products-n-solutions/hardware/docs/Network_Storage_Solutions/Workgroup/index.html

and the "Sun StorEdge 3000 Family CLI User's Guide" at http://www.sun.com/products-n-solutions/hardware/docs/html/817-4951-14


Resolution

This issue is addressed on the following platforms:

  • Sun StorEdge 3310 SCSI array with firmware 4.15F (as delivered in patch 113722-15 or later)
  • Sun StorEdge 3320 SCSI array with firmware 4.15G (as delivered in patch 113730-01 or later)
  • Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
  • Sun StorEdge 3511 FC/SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)

Modification History

12-Jan-2006: Updated Contributing Factors and Relief/Workaround

25-Apr-2006: Updated Contributing Factors and Resolution sections

15-Jun-2006: Updated Contributing Factors and Resolution sections

22-Jul-2010: Document republished as originally posted (2006)

References

<SUNPATCH: 113723-15>
<SUNPATCH: 113722-15>
<SUNPATCH: 113730-01>
<SUNPATCH: 113724-09>

Previously Published As
102086 and 200893 (IBIS)
Internal Comments
Patches for these firmware releases were developed across all products
(all arrays impacted by these issues). Therefore, some of the @patch READMEs
may not reflect the BugID listed in the SunAlert, but the firmware patch listed
for each product does in fact remedy @the issue for the platforms specified.
The following Sun Alerts have information about other known issues for
the 3000 series products:
102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence
of Drive Failures With Firmware 4.1x SMART Feature Enabled
102067 - Sun Cluster 3.x Nodes May Panic Upon Controller
Failure/Replacement
Within Sun StorEdge 3510/3511 Arrays
102086 - Failed Controller Condition May Cause Data Integrity Issues
102098 - Insufficient Information for Recovery From Double Drive Failure
for Sun StorEdge 33x0/35xx Arrays
102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data
Integrity Issues
102127 - Performance Degradation Reported in Controller Firmware Releases
4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns
of I/O
102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are
Generated Between the Host and the SE33x0 Array
102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure,"
"Media Scan Failed" or "Clone Failed" Events
Note: One or more of the above Sun Alerts may require a Sun Spectrum Support
Contract to login to a SunSolve Online account.
Use best judgement from system response and logs to either replace or re-insert
the failed controller.
Note: If the I/O Module on the surviving controller has failed, as determined by
issuing the CLI command "sccli>show ses," then do not pull the failed controller.
You will need to escalate this to backline support.

Internal Eng Responsible Engineer
don.curren@sun.com

Internal Resolution Patches
113723-15, 113722-15, 113730-01, 113724-09


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback