Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays

Asset ID:	1-77-1001167.1
Update Date:	2011-03-01
Keywords:

Solution Type Sun Alert Sure

Solution 1001167.1 : Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays

Related Items


Sun Storage 3510 FC Array
 Sun Storage 3511 SATA Array

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
201561

Product
Sun StorageTek 3510 FC Array
Sun StorageTek 3511 SATA Array

Bug Id
<SUNBUG: 6321239>, <SUNBUG: 6365819>

Date of Workaround Release
01-DEC-2005

Date of Resolved Release
15-JUN-2006

Impact

Upon a controller failure/replacement on a Sun StorEdge 3510/3511 array, all the nodes connected in a Sun Cluster 3.x environment may panic.

Contributing Factors

This issue can occur on the following platforms:

SPARC Platform

Sun StorEdge 3510 FC array with firmware version 4.11I/4.13C (as delivered in patch 113723-10/113723-11) and without firmware 4.15F (as delivered in patch 113723-15)
Sun StorEdge 3511 FC array with firmware version 4.11I/4.13C (as delivered in patch 113724-04/113724-05) and without firmware 4.15F (as delivered in patch 113724-09)

This issue will only occur in cluster configurations that issue SCSI-2 reservations (for example: 2 node clusters) including:

Sun Cluster 3.x on Solaris 8, Solaris 9 and Solaris 10

when LUN filtering is enabled.

Symptoms

Should the described issue occur, all the nodes of the cluster will panic with a Reservation Conflict similar to the following:

    sun50-node:/var/crash/sun50-node #
    panic[cpu19]/thread=2a1001e5d20: Reservation Conflict
    000002a1001e57d0 ssd:ssd_mhd_watch_cb+c4 (3000c318808, 0, 7600000042, 300172f5578, 300172f55a8, 0)
    %l0-3: 000000007829c740 00000300164f3d98 000003000081f2c8 000003000727e320
    %l4-7: 0000030005b9b5e8 0000000000000000 0000000000000002 00000000ff08dffc
    000002a1001e5880 scsi:scsi_watch_request_intr+140 (0, 0, 30015f873c0, 300164f3d98, 0, 300172f5530)
    %l0-3: 000000001034aadc 000003000081f2c8 0000030005b9b5e8 000003002cdc0f30
    %l4-7: 000003000727e320 00000000782a81ac 00000300172f55a8 000003002b73e000
    000002a1001e5950 qlc:qlc_task_thread+698 (300008207f0, 300008207e8, ff00, 300008207f2, 783c3240, 783c3250)
    %l0-3: 000000007829920c 00000000783c3260 00000000783c3270 000000000001ff80
    %l4-7: 0000030000820ac0 000003100b5ef180 00000300008207e8 00000300008207c8
    000002a1001e5a60 qlc:qlc_task_daemon+70 (300008207e8, 300008207c8, 300008207f2, 104640c0, 30000820af8, 30000820afa)
    %l0-3: 0000030000820ae0 00000310002fdb20 0000000000000000 0000000010408000

Note: After the nodes boot following a panic, they will not be able to see the LUNs from the Sun StorEdge 3500/3511 array. Both the nodes will show the drives as "<drive not available: reserved>" when using format(1M).

Only after the Sun StorEdge 3510/3511 array is reset and the nodes are rebooted will everything return to normal.

The following example will show how the controller fw handling of the reservation of the nexus (controller, target, lun) at a LUN level can cause the reservation conflict to happen, when using LUN filtering.

Example from a "show map" output:

    Ch Tgt LUN   ld/lv   ID-Partition  Assigned  Filter Map
    -------------------------------------------------------------------
    0  40     0   ld0    24A193C9-00   Primary   210000E08B13AC6F {HBA-1}
    0  40     1   ld0    24A193C9-02   Primary   210000E08B13AC6F {HBA-1}  <--
    0  40     1   ld0    24A193C9-02   Primary   210000E08B133FC2 {HBA-2}  <--
    0  40     2   ld0    24A193C9-03   Primary   210000E08B13AC63 {HBA-3}  <--
    0  40     2   ld0    24A193C9-05   Primary   210000E08B133FC4 {HBA-4}  <--

Note that LUN #1 is being used for the same partition, 24A193C9-02, to two different initiators/HBAs, {HBA-1} and {HBA-2}.

Note that LUN #2 is being used for the different partitions, 24A193C9-03 and 24A193C9-05, to two different initiators/HBAs, {HBA-4} and {HBA-3}.

During a controller failure/reset, a reservation on one nexus can assert itself to the other nexus with the "same LUN number".

There have been a few cases reported that the process of logical drive partition/repartition can cause the reservation panic. While the issue with controller failure/reset is known, root cause of the partition/repartition issue is still in progress.

Workaround

To work around the described issue, disable LUN filtering and use switch zoning. Instructions for LUN filtering can be found at:

Sun StorEdge 3000 Family CLI 2.x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-4951-14

Sun StorEdge 3000 Family RAID Firmware 4.1x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-3711-14

For Switch zoning consult the corresponding manufacturer documentation.

Resolution

This issue is addressed on the following platforms:

Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)

Modification History
Date: 12-JAN-2006

12-Jan-2006:

Updated Contributing Factors, Relief/Workaround

Date: 25-APR-2006

Updated Contributing Factors and Resolution sections

Date: 15-JUN-2006

State: Resolved
Updated Contributing Factors and Resolution sections

References

<SUNPATCH: 113723-15>
<SUNPATCH: 113724-09>

Previously Published As
102067
Internal Comments

The issue associated with BugID - 6321239 will be resolved in firmware release around 2nd quarter 2006. The issue associated with BugID -6365819 is currently under investigation.

The following Sun Alerts have information about other known issues for the 3000 series products:

102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence of Drive Failures With Firmware 4.1x SMART Feature Enabled

102067 - Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays

102086 - Failed Controller Condition May Cause Data Integrity Issues

102098 - Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays

102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues

102127 - Performance Degradation Reported in Controller Firmware Releases 4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns of I/O

102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are Generated Between the Host and the SE33x0 Array

102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events

Note: One or more of the above Sun Alerts may require a Sun Spectrum Support Contract to login to a SunSolve Online account.

The issue with the controller failure has been duplicated in the lab. The issue with the partition/repartition HAS NOT been reproduced yet and RCA is still under progress.

Test results indicate that when a controller fails, and there are existing SCSI-2 reservations, a reservation may be incorrectly set on a nexus. This will cause a loss of access to the path, and the cluster nodes to panic. Upon the controller failure the device will be reset by the host, along with an implicit fabric logout, that will clear the existing reserve(6) reservations. All testing to date indicates this is only an issue with SE3510 2 node clusters due to the use of Reserve(6) command. The investigation into root cause is in process.

This issue was not seen on 3.27R but cannot be verified.

There is no need for a host i/o to be present. This issue can occur on a newly rebooted server.

Internal Contributor/submitter
sailesh.thanki@sun.com

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
sailesh.thanki@sun.com

Internal Services Knowledge Engineer
jeff.folla@sun.com

Internal Escalation ID
1-11214948, 1-13027714, 1-13069641

Internal Resolution Patches
113723-15, 113724-09

Internal Sun Alert Kasp Legacy ID
102067

Internal Sun Alert & FAB Admin Info
Critical Category: Availability ==> HA-Failure
Significant Change Date: 2005-12-01, 2006-06-15
Avoidance: Patch, Workaround
Responsible Manager: sunil.bali@sun.com
Original Admin Info: [WF -19-Jun-2006, Jeff Folla: Changed Audience from Contract to Free.]

[WF 15 -Jun-2006, Jeff Folla: All patches are available. This is now resolved. Sent for re-release.]

[WF 12-Jun-2006, Dave M: FW released this week, updating with 6 other alerts to publish together for FW patch release coordinated]
[WF 25-Apr-2006, Jeff Folla: FW 4.15F patch now available. Sent to publish.]
[WF 14-Apr-2006, Dave M: updated in anticipation of FW 4.15F release, per NWS and PTS engs]
[WF 12-Jan-2006, Dave M: OK to republish]
[WF 03-Jan-2005, Dave M: updating for re-release per Storage group and Executive review]
[WF 01-Dec-2005, Jeff Folla: Sent for release.]

[WF 30-Nov-2005, Jeff Folla: Sent for review.]
Product_uuid
58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array
9fdbb196-73a6-11d8-9e3a-080020a9ed93|Sun StorageTek 3511 SATA Array

References

SUNPATCH:113723-15
SUNPATCH:113724-09

Attachments

This solution has no attachment