Document Audience: | INTERNAL |
Document ID: | I0552-1 |
Title: | SCSI devices (especially in a multi-hosted configuration) may go offline after isp errors.) |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2004-01-07 |
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0552-1
Synopsis: SCSI devices (especially in a multi-hosted configuration) may go offline after isp errors.)Create Date: Jan/24/00
Keywords:
SCSI devices (especially in a multi-hosted configuration) may go offline after isp errors.)
Top FIN/FCO Report: Yes
Products Reference: isp driver bug
Product Category: Storage / Sw Admin;
Product Affected:
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
Systems Affected
----------------
- A11 ALL Ultra Enterprise 1 -
- A12 ALL Ultra Enterprise 1E -
- A14 ALL Ultra Enterprise 2 -
- E3000 ALL Ultra Enterprise 3000 -
- E3500 ALL Ultra Enterprise 3500 -
- E4000 ALL Ultra Enterprise 4000 -
- E4500 ALL Ultra Enterprise 4500 -
- E5000 ALL Ultra Enterprise 5000 -
- E5500 ALL Ultra Enterprise 5500 -
- E6000 ALL Ultra Enterprise 6000 -
- E6500 ALL Ultra Enterprise 6500 -
- E10000 ALL Ultra Enterprise 10000 -
(See Corrective Action)
X-Options Affected
------------------
- - ALL StorEdge UniPack -
- - ALL StorEdge MultiPack -
- - ALL StorEdge MultiPack2 -
- - ALL StorEdge A1000 -
- - ALL Netra st A1000 -
- - ALL StorEdge D1000 -
- - ALL Netra st D1000 -
- - ALL StorEdge A3500 -
- - ALL StorEdge L280 tape library -
- - ALL StorEdge L700 tape library -
- - ALL StorEdge L1000 tape library -
- - ALL StorEdge L1800 tape library -
- - ALL StorEdge L3500 tape library -
- - ALL StorEdge L11000 tape library -
Parts Affected:
Part Number Description Model
----------- ----------- -----
370-2443-0X Differential Ultra/Wide SCSI (UDWIS/S) -
370-1704-0X Differential Fast/Wide SCSI (DWIS/S) -
370-1703-0X Single-Ended Fast/Wide SCSI (SWIS/S) -
References:
BugId: 4280783
Esc: 523110 522262
FIN: I0547-1
PatchId: 105600-XX (Solaris 2.6)
PatchId: 106924-XX (Solaris 2.7)
Issue Description:
When a SCSI bus reset is issued under heavy i/o, the isp driver causes the
sd driver to report i/o errors. Any configuration with an isp driver version
prior to the fix to bug 4280783 may be affected.
The most likely configuration to experience this problem is multi-hosted and
shared storage devices e.g. A3x00 (SCSI version only), A1000, D1000 etc.
connected to differential SCSI cards using the isp driver, or MultiPack,
UniPack etc. connected to single-ended SCSI cards using the isp driver.
However, since a SCSI bus reset can occur as part of the error recovery
controlled by the sd driver, this problem can occur under error conditions
even with SCSI devices connected just to a single HBA which uses the isp
driver.
Third party storage products attached to controllers which use the isp driver
may also be affected.
Running cluster software does not prevent the problem from occurring.
Example 1
---------
Here is an example of the sequence of events in a multi-hosted
configuration, with shared SCSI storage devices. In this
configuration, when one node is rebooted, it issues SCSI bus resets
during the process of restarting and the other (running) node would
receive those resets. This is normal.
The following message is registered on the running node, when the other
node reboots. There should be one such message per shared SCSI bus.
unix: Received unexpected SCSI Reset
The i/os are returned with reset flag set as indicated by sd driver.
unix: WARNING: /sbus@a,0/QLGC,isp@2,10000/sd@4,5 (sd229):
unix: SCSI transport failed: reason 'reset': retrying command
This is also normal.
As a result of the SCSI bus reset, commands are transferred from the
request queue to the response queue with the reset flag set. They are
not sent to the device until a marker is sent from isp driver to the
firmware on the HBA. This is a memory to memory transfer and hence too
fast. (Refer to bug# 4283089 for isp chip function during reset
handling.)
The sd driver would retry the i/o (sd_retry_count number of times).
This can be seen by the following message.
unix: WARNING: /sbus@a,0/QLGC,isp@2,10000/sd@4,5 (sd229):
unix: SCSI transport failed: reason 'timeout': retrying command
Due to this bug, under heavy i/o load, it can happen that before the marker
is accepted by the firmware on the HBA, all of the sd driver retries fail.
The sd driver then fails the i/o ("giving up") and returns an error to the
upper layer. This is identified by the following message.
unix: WARNING: /sbus@a,0/QLGC,isp@2,10000/sd@4,5 (sd229):
unix: SCSI transport failed: reason 'timeout': giving up
After this i/o error, the sd driver gives up trying to communicate with that
SCSI device.
[End of Example 1]
If this situation occurs, the system may lose access to one or more
devices on a SCSI bus. This could include devices normal filesystems,
the root disk, raw devices used by databases etc.
Implementation:
---
| | MANDATORY (Fully Pro-Active)
---
---
| X | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
Corrective Action:
Enterprise Customers and authorized Field Service Representatives may
avoid the above mentioned problems by following the recommendations
as shown below:
Apply isp patch 105600-15 or greater for Solaris 2.6 or
patch 106924-05 or greater for Solaris 7.
(If this situation occurs before the patch above is applied, then the
system may have to be rebooted to regain control of the affected
devices.)
The recommendation is to evaluate customer configurations to determine
whether this change applies. This problem can affect any devices
connected to UDWIS, DWIS, and SWIS cards including MultiPacks, D1000,
A1000, A3x00, and A7000 Sun Storage products, as well as SCSI-attached
OEM storage products, especially (but not only) when connected in
multi-initiator configurations.
Also strongly recommend that any mission-critical sites implement this
change.
Comments:
--------------------------------------------------------------------------
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist. Edist can be
accessed internally at the following URL: http://edist.corp/.
* From there, follow the hyperlink path of "Enterprise Services Documenta-
tion" and click on "FIN & FCO attachments", then choose the appropriate
folder, FIN or FCO. This will display supporting directories/files for
FINs or FCOs.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------