Brocade 3200/3800 Switches May Lose SAN Storage Path Following Link Event During I/O

Asset ID:	1-77-1000094.1
Update Date:	2011-03-01
Keywords:

Solution Type Sun Alert Sure

Solution 1000094.1 : Brocade 3200/3800 Switches May Lose SAN Storage Path Following Link Event During I/O

Related Items


Sun Storage T3+ Array

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
200112

Product
Sun StorageTek T3+ Array
SAN Brocade 3800 2 GB 16-Port Switch
SAN Brocade 3200 2 GB 8-Port Switch

Bug Id
<SUNBUG: 6336972>, <SUNBUG: 6350979>

Date of Workaround Release
21-DEC-2005

Date of Resolved Release
08-SEP-2006

Impact

Brocade 3200 and 3800 SilkWorm switches running certain Fabric Operating Systems (FabOS) versions may experience a loss of access to storage via one or more paths.

Contributing Factors

This issue can occur on the following platforms:

SG-XSWBRO3200 Brocade SilkWorm 3200 switch (8 ports) running FabOS 3.2.0a (as delivered in patch 115360-05) and without FabOS 3.2.1a (as delivered in patch 115360-06)
SG-XSWBRO3800 Brocade Silkworm 3800 switch (16 ports) running FabOS 3.2.0a (as delivered in patch 115360-05) and without FabOS 3.2.1a (as delivered in patch 115360-06)
Sun StorEdge T3+ array (running firmware 3.2.x) without firmware 3.2.4 (as delivered in patch 116930-05)

Note: This issue only occurs on Brocade Silkworm switches running FabOS 3.2.0a (patch 115360-05), following a "link event" (online/offline/online transition or array/switch reconfiguration) that causes the storage to fabric login (FLOGI) to the SAN. In addition, the event only occurs if I/O is occurring during the "link event". During this combination of factors, it is possible for certain 1GB Fport devices (such as storage array T3+) to incorrectly overrun the switch frame buffer credit mechanism.

Symptoms

On hosts running Sun "fp" and "MPxIO" drivers, "PLOGI timeout" messages and host messages from STMS will report that LUNs are being offlined, and that the paths allowing access to those LUNs are now degraded due to the loss of one (or more) path(s). These messages will be displayed in the array syslog, similar to the following example:

    [date time hostname] fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to
    10800 failed state=Timeout, reason=Hardware Error
  [date time hostname] scsi: [ID 243001 kern.warning] WARNING:
    /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fcp1):
  [date time hostname]   PLOGI to D_ID=0x10800 failed:State:Timeout,
    Reason:Hardware Error. Giving up
  [date time hostname] fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to
    10800 failed state=Timeout, reason=Hardware Error
    [date time hostname] scsi: [ID 243001 kern.warning] WARNING:
    /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fcp1):
  [date time hostname]   PLOGI to D_ID=0x10800 failed: State:Timeout,
    Reason:Hardware Error. Giving up
  [date time hostname] scsi: [ID 243001 kern.info]
    /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fcp1):
  [date time hostname]   offlining lun=1 (trace=0), target=10800
    (trace=2800101)
  [date time hostname] mpxio: [ID 779286 kern.info]
    /scsi_vhci/ssd@g60020f20000097ab41b6d08700036e53 (ssd33) multipath
    status: degraded, path /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fp1) to
    target address: 50020f23000092a6,1 is offline

The above are examples only. On each system, the LUN numbers, target numbers, and device paths will vary. To identify that this issue is being seen, check the target trace value ("trace=2800101" above), that the PLOGI failure state was "timeout," and check the overall sequence of events.

Note: It is important to differentiate configuration issues that will not allow port login (causing PLOGI errors), against instances where hosts should be able to PLOGI to SAN storage that they have previously used. Certain HBA drivers will not allow port login (PLOGI) from another host, generating misleading PLOGI errors. Certain storage arrays (i.e. SE99x0) that have no configured LUNs available to the host in question, will not allow port login (PLOGI), thereby generating misleading PLOGI errors.

Workaround

To avoid this issue, do not make configuration changes on the array during times when there is I/O to the storage.

Should the described issue occur, wait for any LUN failovers to complete (if applicable) and proceed with the following recommendations:

On the hosts(s) where the above STMS "offlining lun" and "multipath status: degraded" messages were seen immediately following the "PLOGI" timeout to that device, run the luxadm(1M) command as "root" to the identified World-Wide-Number (WWN). In the example shown above in "Symptoms", the error occurred to the storage device with a WWN of 50020f23000092a6, so the command to run would be:

    # luxadm -e forcelip 50020f23000092a6

Following the above luxadm(1M) command, check for storage connectivity via the same path displayed:

    # luxadm display 50020f23000092a6

This should show the path to storage being either ONLINE or STANDBY.

If the path being shown is neither ONLINE or STANDBY, and the "luxadm -e forecelip <WWN>" resulted in additional PLOGI errors being recorded, then the attempted recovery has not worked and recovery will require the rebooting of the switch itself. This action will potentially affect the connectivity of other hosts to the storage, so it is imperative to ensure that the issue is not due to incorrect configuration (SAN zoning or LUN mapping) and that any other host on this switch has an alternative path to the storage prior to resetting the affected switch.

For additional information on Brocade FabOS "EOL" versions, please see the Brocade EOL URL at http://www.brocade.com/support/end_of_life.jsp

Resolution

This issue is resolved on the following platforms:

SG-XSWBRO3200 Brocade SilkWorm 3200 switch (8 ports) with FabOS 3.2.1a (as delivered in patch 115360-06) or later
SG-XSWBRO3800 Brocade Silkworm 3800 switch (16 ports) with FabOS 3.2.1a (as delivered in patch 115360-06) or later
Sun StorEdge T3+ array with firmware 3.2.4 (as delivered in patch 116930-05) or later

Modification History
Date: 08-SEP-2006

08-Sep-2006:

Updated Contributing Factors and Resolution sections
State: Resolved

References

<SUNPATCH: 115360-06>
<SUNPATCH: 116930-05>

Previously Published As
102045
Internal Comments

Please be aware that a similar issue exists for 3900/12000/3250/3850/24000 switches - see SunAlert 102046.

The issue is caused by the storage device overrunning the switches buffer credit mechanism . It can be caused when I/O to/from the array occurs at the same time as the link between switch and storage is reinitialized. Due to the timing required to encounter the problem, it is believed it is more likely to occur on 1GB links/devices. Link initialization can occur if there are bit errors on the link (enough to cause an online/offline transition) or certain configuration commands are made on the array (volslice, lun perm, hwwn).

Internal Contributor/submitter
brian.austin@sun.com

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
brian.austin@sun.com

Internal Services Knowledge Engineer
david.mariotto@sun.com

Internal Escalation ID
1-11860721

Internal Resolution Patches
115360-06, 116930-05

Internal Sun Alert Kasp Legacy ID
102045

Internal Sun Alert & FAB Admin Info
Critical Category: Availability ==> HA-Failure
Significant Change Date: 2005-12-21, 2006-09-08
Avoidance: Patch
Responsible Manager: Karl-Heinz.Wegener@Sun.COM
Original Admin Info: [WF 08-Sep-2006, Dave M: update for patch, resolved, rerelease]
[WF 21-Dec-2005, Dave M: sending for release]
[WF 21-Dec-2005, Dave M: final reviews sent from Brocade, rework, send for final approval]
[WF 16-Nov-2005, Dave M: sending for review]
[WF 15-Nov-2005, Dave M: draft created]

Product_uuid
2a714b10-0a18-11d6-86e2-d56b387d4fbf|Sun StorageTek T3+ Array
5b8e7ce2-11f4-11d7-8cc5-a7ac2a6e4672|SAN Brocade 3800 2 GB 16-Port Switch
5c088938-11f4-11d7-9d82-d45cc101dd0a|SAN Brocade 3200 2 GB 8-Port Switch

References

SUNPATCH:115360-06
SUNPATCH:116930-05

Attachments

This solution has no attachment