Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1000094.1 : Brocade 3200/3800 Switches May Lose SAN Storage Path Following Link Event During I/O
PreviouslyPublishedAs 200112 Product Sun StorageTek T3+ Array SAN Brocade 3800 2 GB 16-Port Switch SAN Brocade 3200 2 GB 8-Port Switch Bug Id <SUNBUG: 6336972>, <SUNBUG: 6350979> Date of Workaround Release 21-DEC-2005 Date of Resolved Release 08-SEP-2006 Impact Brocade 3200 and 3800 SilkWorm switches running certain Fabric Operating Systems (FabOS) versions may experience a loss of access to storage via one or more paths. Contributing Factors This issue can occur on the following platforms:
Note: This issue only occurs on Brocade Silkworm switches running FabOS 3.2.0a (patch 115360-05), following a "link event" (online/offline/online transition or array/switch reconfiguration) that causes the storage to fabric login (FLOGI) to the SAN. In addition, the event only occurs if I/O is occurring during the "link event". During this combination of factors, it is possible for certain 1GB Fport devices (such as storage array T3+) to incorrectly overrun the switch frame buffer credit mechanism. Symptoms On hosts running Sun "fp" and "MPxIO" drivers, "PLOGI timeout" messages and host messages from STMS will report that LUNs are being offlined, and that the paths allowing access to those LUNs are now degraded due to the loss of one (or more) path(s). These messages will be displayed in the array syslog, similar to the following example: [date time hostname] fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10800 failed state=Timeout, reason=Hardware Error [date time hostname] scsi: [ID 243001 kern.warning] WARNING: /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fcp1): [date time hostname] PLOGI to D_ID=0x10800 failed:State:Timeout, Reason:Hardware Error. Giving up [date time hostname] fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10800 failed state=Timeout, reason=Hardware Error [date time hostname] scsi: [ID 243001 kern.warning] WARNING: /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fcp1): [date time hostname] PLOGI to D_ID=0x10800 failed: State:Timeout, Reason:Hardware Error. Giving up [date time hostname] scsi: [ID 243001 kern.info] /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fcp1): [date time hostname] offlining lun=1 (trace=0), target=10800 (trace=2800101) [date time hostname] mpxio: [ID 779286 kern.info] /scsi_vhci/ssd@g60020f20000097ab41b6d08700036e53 (ssd33) multipath status: degraded, path /sbus@54,0/SUNW,qlc@1,30400/fp@0,0 (fp1) to target address: 50020f23000092a6,1 is offline The above are examples only. On each system, the LUN numbers, target numbers, and device paths will vary. To identify that this issue is being seen, check the target trace value ("trace=2800101" above), that the PLOGI failure state was "timeout," and check the overall sequence of events. Note: It is important to differentiate configuration issues that will not allow port login (causing PLOGI errors), against instances where hosts should be able to PLOGI to SAN storage that they have previously used. Certain HBA drivers will not allow port login (PLOGI) from another host, generating misleading PLOGI errors. Certain storage arrays (i.e. SE99x0) that have no configured LUNs available to the host in question, will not allow port login (PLOGI), thereby generating misleading PLOGI errors. Workaround To avoid this issue, do not make configuration changes on the array during times when there is I/O to the storage. Should the described issue occur, wait for any LUN failovers to complete (if applicable) and proceed with the following recommendations: On the hosts(s) where the above STMS "offlining lun" and "multipath status: degraded" messages were seen immediately following the "PLOGI" timeout to that device, run the luxadm(1M) command as "root" to the identified World-Wide-Number (WWN). In the example shown above in "Symptoms", the error occurred to the storage device with a WWN of 50020f23000092a6, so the command to run would be: # luxadm -e forcelip 50020f23000092a6 Following the above luxadm(1M) command, check for storage connectivity via the same path displayed: # luxadm display 50020f23000092a6 This should show the path to storage being either ONLINE or STANDBY. If the path being shown is neither ONLINE or STANDBY, and the "luxadm -e forecelip <WWN>" resulted in additional PLOGI errors being recorded, then the attempted recovery has not worked and recovery will require the rebooting of the switch itself. This action will potentially affect the connectivity of other hosts to the storage, so it is imperative to ensure that the issue is not due to incorrect configuration (SAN zoning or LUN mapping) and that any other host on this switch has an alternative path to the storage prior to resetting the affected switch. For additional information on Brocade FabOS "EOL" versions, please see the Brocade EOL URL at http://www.brocade.com/support/end_of_life.jsp Resolution This issue is resolved on the following platforms:
Modification History Date: 08-SEP-2006 08-Sep-2006:
References<SUNPATCH: 115360-06><SUNPATCH: 116930-05> Previously Published As 102045 Internal Comments Please be aware that a similar issue exists for 3900/12000/3250/3850/24000 switches - see SunAlert 102046. The issue is caused by the storage device overrunning the switches buffer credit mechanism . It can be caused when I/O to/from the array occurs at the same time as the link between switch and storage is reinitialized. Due to the timing required to encounter the problem, it is believed it is more likely to occur on 1GB links/devices. Link initialization can occur if there are bit errors on the link (enough to cause an online/offline transition) or certain configuration commands are made on the array (volslice, lun perm, hwwn). Internal Contributor/submitter brian.austin@sun.com Internal Eng Business Unit Group NWS (Network Storage) Internal Eng Responsible Engineer brian.austin@sun.com Internal Services Knowledge Engineer david.mariotto@sun.com Internal Escalation ID 1-11860721 Internal Resolution Patches 115360-06, 116930-05 Internal Sun Alert Kasp Legacy ID 102045 Internal Sun Alert & FAB Admin Info Critical Category: Availability ==> HA-Failure Significant Change Date: 2005-12-21, 2006-09-08 Avoidance: Patch Responsible Manager: Karl-Heinz.Wegener@Sun.COM Original Admin Info: [WF 08-Sep-2006, Dave M: update for patch, resolved, rerelease] [WF 21-Dec-2005, Dave M: sending for release] [WF 21-Dec-2005, Dave M: final reviews sent from Brocade, rework, send for final approval] [WF 16-Nov-2005, Dave M: sending for review] [WF 15-Nov-2005, Dave M: draft created] Product_uuid 2a714b10-0a18-11d6-86e2-d56b387d4fbf|Sun StorageTek T3+ Array 5b8e7ce2-11f4-11d7-8cc5-a7ac2a6e4672|SAN Brocade 3800 2 GB 16-Port Switch 5c088938-11f4-11d7-9d82-d45cc101dd0a|SAN Brocade 3200 2 GB 8-Port Switch ReferencesSUNPATCH:115360-06SUNPATCH:116930-05 Attachments This solution has no attachment |
||||||||||||
|