Document Audience:	INTERNAL
Document ID:	A0209-1
Title:	Sun Fire 15K & Sun Fire 12K with Crystal+ cards may experience panics in the pcisch driver.
Copyright Notice:	Copyright © 2007 Sun Microsystems, Inc. All Rights Reserved
Update Date:	Fri Jun 13 00:00:00 MDT 2003

----------------------------------------------------------------------------
             - Sun Proprietary/Confidential: Internal Use Only -
----------------------------------------------------------------------------

                             FIELD CHANGE ORDER
            (For Authorized Distribution by Enterprise Services)

FCO #: A0209-1

Status: inactive

Synopsis: Sun Fire 15K & Sun Fire 12K with Crystal+ cards may experience panics in the pcisch driver.

Date: Jun/13/2003

SunAlert: No

Top FIN/FCO Report: No

Products Reference: Sun Fire 15K/12K

Product Category: Server / System Component

Product Affected:

Mkt_ID     Platform     Model    Description          Serial Number
------	   --------     -----    -----------	      -------------
  -         F15K          -      Sun Fire 15K               -
  -         F12K          -      Sun Fire 12K               -

X-Options Affected

Mkt_ID     Platform     Model    Description               Serial Number
------	   --------     -----    -----------	           -------------
X6727A     F15K/F12K      -      PCI Dual FC Network Adapter+    -

Parts Affected:

Part Number     Description                             Model
-----------     -----------                             -----
375-3030-xx     PCI Dual FC Network Adapter+              -


(SCSI Devices)
Type   Vendor    Model     SerialNumber(Min)   SerialNumber(Max)   Firmware
----   ------    -------   -----------------   -----------------   --------
N/A

References:

ESC: 537306
  FIN: IO852-1
  BugID: 4699182

Issue Description:

Sun Fire 15K and Sun Fire 12K systems with PCI Dual FC Network Adapter+
(Crystal+) used in the 66MHz slots may experience pcisch driver panics
due to a parity error on the PCI Bus.

Panics in the pcisch driver cover a wide range of possible failures.
In this case, the control status register (CSR) calls out the detection
of bad parity on the PCI bus:

  WARNING: pcisch-19: PCI fault log start:
  PCI SERR
  PCI error occurred on device #0
  dwordmask=0 bytemask=0
  pcisch-19: PCI primary error (0):pcisch-19: PCI secondary error (0):pcisch-19:
       PBM AFAR 0.00000000:WARNING: pcisch19: PCI config space
       CSR=0xc2a0
  pcisch-19: PCI fault log end.

  panic[cpu128]/thread=2a10001fd20: pcisch-19: PCI bus 3 error(s)!

  000002a10001bea0 pcisch:pbm_error_intr+148 (30000b643d8, 2772, 30000b84548, 3,
        30000b643d8, 3)
    %l0-3: 00000300008b9860 0000000000004000 0000000000000000 0000030000b86584
    %l4-7: 00000300009978c8 0000030008d03ea8 0000000000000000 0000030008d03ed0
  000002a10001bf50 unix:current_thread+44 (0, ffffffffffffffff, 0, 300335b3528,
        0, 1044f340)
    %l0-3: 0000000010007450 000002a10001f061 000000000000000e 0000000000000016
    %l4-7: 0000000000010000 00000300339922a8 000000000000000b 000002a10001f910
  000002a10001f9b0 unix:disp_getwork+40 (1044e398, 0, 1044f340, 10457310, 2, 0)
    %l0-3: 000000001010e2d8 0000000010509e00 00000300335bd518 000002a100c37d20
    %l4-7: 000002a100cebd20 0000000002736110 0000000000000000 000002a10001f9c0
  000002a10001fa60 unix:idle+a4 (0, 0, 80, 1044e398, 3000096d980, 0)
    %l0-3: 0000000010043d58 2030205b275d2076 616c20696e646578 000002a10011dd20
    %l4-7: 70636220290a2020 202e22202073703a 20222031205b275d 2076616c20696e64

NOTE:  The stack itself can be different, depending on each specific case.  What
       matters is the CSR values (specifically the "detected-parity-error" bit).

Although this type of panic can result from a hardware issue on any adapter,
this FCO is only addressing those with a PCI Dual FC Network Adapter+.  In
addition, this FCO is only legitimate for failures in the 66Mhz slots (bottom
slots of an hsPCI assembly).

With every other panic of this nature, a hardware replacement has resolved
the case.  However, with one customer, repeated hardware replacements did not
resolve the issue.  The customer's issue has since been replicated on multiple
machines in an engineering environment.  There are some unique factors that
are needed to create this scenario:

  A. To date, this problem has only been seen on 375-3030 (Crystal+)
     cards.
  B. All the panics have been in either slot 0 or slot 2 of the I/O Boat.
     (Slots 0 and 2 is the lower 66 MHz slots)
  C. Schizo 2.3 seems to bring the problem out with more regularity.
  D. Veritas software (specifically adding mirrors to volumes) seems
     to increase the likelihood of failure.

Please review the Steps for Diagnosis in the Special Considersations
section below before implementing any corrective action.

Parts Affected:

N/A

Implementation:

---
|   |   MANDATORY (Fully Pro-Active)
 ---

 ---
|   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
 ---

 ---
| X |   UPON FAILURE
 ---

Replacement Time Estimate:

4.0 hours

Special Considerations:

Steps for Diagnosis
===================

 1)  Isolate the offending PCI bus:

 As a reminder, when looking at a starcat I/O boat, the slots
 are designated:

   |--------------------------|--------------------------|
   | Schizo 1, leaf B (33Mhz) | Schizo 0, leaf B (33Mhz) |
   |--------------------------|--------------------------|
   | Schizo 1, leaf A (66Mhz) | Schizo 0, leaf A (66Mhz) |
   |--------------------------|--------------------------|

   OR

   |--------|--------|
   | Slot 3 | Slot 1 |
   |   OR   |   OR   |
   | X.1.1.1| X.1.0.1|
   |--------|--------|
   | Slot 2 | Slot 0 |
   |   OR   |   OR   |
   | X.1.1.0| X.1.0.0|
   |--------|--------|

   NOTE: X = hsPCI number (0-17)

To diagnose the pcisch panic from the above stack, follow these steps:

  Use the /etc/path_to_inst on the domain or the cfgadm/rcfgadm commands
  to isolate the slot.  For example, using the two methods with the panic
  above (pcisch-19):

      # grep pcisch /etc/path_to_inst
      "/pci@3d,600000" 7 "pcisch"
      "/pci@1c,700000" 0 "pcisch"
      "/pci@3c,700000" 4 "pcisch"
  --> "/pci@9d,600000" 19 "pcisch"
      "/pci@9c,600000" 17 "pcisch"
      "/pci@3c,600000" 5 "pcisch"
      "/pci@5d,600000" 11 "pcisch"
      "/pci@7d,600000" 15 "pcisch"


     In this case, instance 19 is "/pci@9d,600000".  To
     translate that into a slot location, break down the 9d into
     binary <10011101>, then add spaces to obtain <100 1110 1>.
     That address now breaks down to slot 4 (100), skip the
     middle section (1110), pci 1 (or the pci slot on the
     left).

     The other option is to leverage the conversion the dynamic
     reconfiguration interface provides:

  # rcfgadm -d a -la | grep pcisch
  pcisch0:e00b1slot1      pci-pci/hp   connected    configured   ok
  pcisch10:e02b1slot3     unknown      connected    unconfigured unknown
  pcisch11:e02b1slot2     pci-pci/hp   connected    configured   ok
  pcisch12:e03b1slot1     pci-pci/hp   connected    configured   ok
  pcisch13:e03b1slot0     pci-pci/hp   connected    configured   ok
  pcisch14:e03b1slot3     unknown      connected    unconfigured unknown
  pcisch15:e03b1slot2     pci-pci/hp   connected    configured   ok
  pcisch16:e04b1slot1     unknown      connected    unconfigured unknown
  pcisch17:e04b1slot0     pci-pci/hp   connected    configured   ok
  pcisch18:e04b1slot3     unknown      connected    unconfigured unknown
     --> pcisch19:e04b1slot2     unknown      empty   unconfigured unknown
  pcisch1:e00b1slot0      unknown      empty   unconfigured unknown
  pcisch20:e08b1slot1     unknown      empty   unconfigured unknown
  pcisch21:e08b1slot0     pci-pci/hp   connected    configured   ok
  pcisch22:e08b1slot3     unknown      empty   unconfigured unknown
  pcisch23:e08b1slot2     unknown      empty   unconfigured unknown
  pcisch2:e00b1slot3      unknown      connected    unconfigured unknown
  pcisch3:e00b1slot2      pci-pci/hp   connected    configured   ok
  pcisch4:e01b1slot1      pci-pci/hp   connected   configured   ok
  pcisch5:e01b1slot0      unknown      empty   unconfigured unknown
  pcisch6:e01b1slot3      unknown      connected    unconfigured unknown
  pcisch7:e01b1slot2      pci-pci/hp   connected    configured   ok
  pcisch8:e02b1slot1      pci-pci/hp   connected    configured   ok
  pcisch9:e02b1slot0      unknown      connected    unconfigured unknown

     In this case, the issue is on expander 4 (ex4), I/0 board
     (b1), slot 2.

  b) Once the offending FRU has been identified, follow FIN IO852-1
and replace the hsPCI and the cassette called out in the panic.
Once completed, replace ALL x6272A's within the domain with
x6768A (Crystal2A), including x6727A that have not generated
panics.

     So, for the example above, we would replace the hsPCI in
     slot 4, the cassette in slot 2 (lower left), the x6727A
     with a x6768A and all other x6727A's in this domain.

     It is expected that some customers may wish to take the
     down time and replace all x6727A's in their entire platform
     where applicable.  This action has been approved under this
     FCO.

      EXCEPTION:  If customer attached A3500FC (540-4026 or 540-4027)
      to F12/15K via Crystal+, then x6799A (Amber) must be used in
      place of x6768A (Crystal 2A).

  c) There are some hardware prerequisites you might have to
  contend with:

  - The cables used for the x6768A differ from the cables
    used for the x6727A.  Before performing this FCO,
    verify and replace all required cables or use LC ->
    SC adapters.

 replace 537-1004 2 Meter SC-SC with 537-1035 2 Meter LC-SC

 replace 537-1020 5 Meter SC-SC with 537-1033 5 Meter LC-SC

 replace 537-1004 15 Meter SC-SC with 537-1034 15 Meter LC-SC

 If a custom length SC-SC cable is in use, order 0.4 Meter LC-SC
 cable 537-1036 and SC-SC Female to Female coupler 130-4723.


  d) There are some software prerequisites you might have to
  contend with:

  - Slot 1 DR will be available with SMS1.3.  The current
    target date (always subject to change) for SMS1.3 is
    sometime near the end of Janurary 2003.  Until then,
    the system will have to incur a downtime for
    replacement.

  - If the boot device is on the to-be-replaced hsPCI, it will be
    necessary to have planned the system configuration to
    allow a boot-device DR (i.e. multipathing/mirroring,
    etc).  If you do not have such a capability, the
    domain will have to incur a downtime for
    replacement.

  - Once the x6727A has been replaced with a x6768A, the
    controller number for the disk will change unless you
    follow the procedure below.

  - The customer will need to download the drivers for the x6768A.
    Reference the procedure below.  NOTE: DO NOT FORGET
    TO UPDATE/PATCH THE JUMPSTART IMAGE, IF APPLICABLE.

    At the time of authoring this FCO, the driver
    required according to:

      Sun StorEdge 2G FC PCI Dual Channel Network Adapter
      Product Notes (Part Number: Part No.816-5002-11
      June 2002, Revision A)

      Before installing the Sun StorEdge 2G FC PCI Dual
      Channel Network Adapter card, the host must have
      both the Solaris 8 update 4 operating environment
      release with the recommended patch cluster and the
      Sun StorEdge 2G FC PCI Dual Channel Network Adapter
      driver.

      Check http://www.sun.com/download/ or
      http://www.sun.com/storage/ san for updates.  There
      is one set of packages for the Solaris 8 operating
      environment and another for the Solaris 9 operating
      environment available under the respective links
      for the operating environments. The SUNWsan package
      is interchangeable between the releases.

      Packages:

      SUNWsan
      SUNWcfpl
      SUNWcfplx

      Available at: http://www.sun.com/download/ or
      http://www.sun.com/storage/san

      Patches (NOTE: patches might be uprev'ed.  These
      are the minimum requirements):

      Solaris 8  Solaris 9
      ---------  ---------
      Sun StorEdge Traffic Manager patch 111412-09  113039-01
      fctl/fp/fcp/usoc driver     111095-10  113040-01
      fcip driver   111096-04  113041-01
      qlc driver    111097-10  113042-02
      luxadm/liba5k and libg_fc patch    111413-08  113043-01
      cfgadm fp plug-in library patch    111846-04  113044-01
      SAN Foundation Kit patch    111847-04  111847-04

      Available at: sunsolve.sun.com

  - There is a known issue with replacing an crystal with
  an encapsulated boot device.
    If not donce correctly, device major number will be
    incorrectly set forcing panics.  Please reference the
    procedure below and specifcally the name_to_major
    file references.

  - Replacement Procedure:

  Replace Crystal+ cards with Crystal2A on a F15K/12K

==================================================================

  I. Prerequisites

    - Crystal2A drivers, patches and packages.
      [ refer to Sun StorEdge 2G FC PCI Dual Channel
      Network Adapter Product Notes ]

    - Solaris 8 2/02 with current recommended patch
    cluster and san patches.


    - Dedicated Solaris 8 2/02 network boot/jumpstart image with
      Crystal2A drivers, packages and patches.

    - A good backup of all filesystems.

==================================================================

  II. Preparation

  1. If you are replacing a controller that contains the
  boot device or a Veritas Volume Manager device in
  rootdg, you will have to create a boot server image
  that contains the Crystal2A drivers. Otherwise you may
  skip this step.


     To do this, first create a  Solaris 8 02/02
     JumpStart Boot server.  Then, install the Crystal2A
     drivers, patches and packages into this image and
     copy the /etc/name_to_major file from the domain
     onto the boot image. (This will prevent problems
     with differing major numbers between the domain and
     the boot image).

     Example using a Solaris 8 02/02 boot server image located at
    /jumpstart/5.8_HW202:

   For Packages:
   =============
   cd [ location of packages ]
   pkgadd -R /jumpstart/5.8_HW202/Solaris_8/Tools/Boot -d .

   For Patches:
   ============
   cd [ location of patches ]
   patchadd -C /jumpstart/5.8_HW202/Solaris_8/Tools/Boot
   ./[patchid]

   For /etc/name_to_major:
   =======================
   cd /jumpstart/5.8_HW202/Solaris_8/Tools/Boot/etc
   cp name_to_major name_to_major.orig
   ftp domain
   ftp> cd /etc
   ftp> get name_to_major

  2. Bring the domain down to single user mode

   OK> boot -s

  3. Install patches and packages on the domain.

   Follow normal patchadd and pkgadd procedures.

  4. Verify the controller number[s] for the card being replaced

   # format
   # ls -l /dev/dsk
   # ls -l /dev/ses
     ( You may want to save this output for reference. )

==================================================================

  III. Replacing a controller NOT used for the boot device.

  (This example uses c1 as the controller to be changed.)

  1. If Volume Manager is being used, disable it from starting.

 # touch /etc/vx/reconfig.d/state.d/install-db

  2. Reboot domain into single user mode

 # init 0
 OK> boot -s

     Volume manager should not be running at this point.

 # ps -ef | grep vx   (this should show no volume manager
  processes)

  3. Remove the devices associated with the controller to be
     replaced.

 # cd /dev/dsk
 # rm c1*

 # cd /dev/rdsk
 # rm c1*

 # cd /dev/cfg
 # rm c1 [ this entry may or may not exist ]

 # cd /a/dev/ses ( if applicable )
 # rm ses2 ses3  ( for the ses devices associated with c1 )

  4. Shutdown and replace the cards.  Make sure auto-boot? is
     false.

 # init 0
 OK> setenv auto-boot? false

     Shut off the domain from the SC

 setkeyswitch -d [domainid] off

     Replace the card, turn on the domain.

 setkeyswitch -d [domainid] on

     Verify that the new controller is available from OBP.

 OK> probe-scsi-all

     Do a single user reconfiguration boot

 OK> boot -sr

  5. Verify that the devices were created as expected:

 # format
 # ls -l /dev/dsk/c1* /dev/rdsk/c1*
 # ls -lL /dev/dsk/c1* /dev/rdsk/c1*
 # ls -l /dev/es

     If Veritas Volume Manager was NOT used, check that the
     devices can be mounted.  If all looks good continue to
     multiuser.

     # mountall

  6. If Veritas was disabled previously (Step 1), re-enable it and
     reboot.

 # rm /etc/vx/reconfig.d/state.d/install-db
 # init 6

     Verify that veritas started correctly and all volumes are
     available.

==================================================================

  IV. Replacing a controller that IS used for the boot device.

     (This example uses c0 as the controller to be changed.)

  1. If the boot disk is encapsulated, you must first
     unencapsulate the boot device.

     Reboot and verify that the boot disk has been successfully
     unencapsulated.

  2. If Volume Manager is being used, disable it from starting.

 # touch /etc/vx/reconfig.d/state.d/install-db

  3. Reboot domain into single user mode

 # init 0
 OK> boot -s

     Volume manager should not be running at this point.

 # ps -ef | grep vx   (this should show no volume manager
  processes)

  4. Shutdown and replace the cards.  Make sure auto-boot? is
     false.

 # init 0
 OK> setenv auto-boot? false

     Shut off the domain from the SC

 setkeyswitch -d [domainid] off

     Replace the card, turn on the domain.

 setkeyswitch -d [domainid] on

     Verify that the new controller is availble from OBP.

 OK> probe-scsi-all

  5. Boot from the Crystal-2A enabled JumpStart Boot server (as
     described under preparation.  Verify that the
     devices are   visible.

 OK> boot net -s  (from Crystal-2a patched jumpstart)

 # format

  6. Mount the boot device's / partition at /a.  Remove the
     previous controller's device nodes.

 # mount /dev/dsk/ /a
 # rm /a/dev/dsk/c0* /a/dev/rdsk/c0*
 # rm /a/dev/cfg/c0  (this may or may not exist)

 # rm /a/dev/es/ses0 /a/dev/es/ses1
     ( for the ses devices associated with c0 )

  7. Build and verify the new device nodes and reset-all.

 # devfsadm -r /a -p /a/etc/path_to_inst

 # ls -l /a/dev/dsk/c0* /a/dev/rdsk/c0*
 # ls -l /a/dev/dsk/c0* /a/dev/rdsk/c0*
 # ls -l /a/dev/es

 # umount /a
 # halt

 OK> reset-all

  8. Determine the new boot device path.

   The device path WILL change the Crystal2A has a different FW prom.

     For example, original boot device:

/pci@3d,600000/pci@1/SUNW,qlc@4/fp@0,0/disk@w220000203733433b,0:a
     New boot device:

/pci@3d,600000/SUNW,qlc@1/fp@0,0/disk@w220000203733433b,0:a

 OK> show-disks      (check for new card's device)
 OK> probe-scsi-all  (check that disk are visible)

  Verify the new path of the boot device and use nvunalias and
  nvalias to record it.

 OK> nvunalias [old-boot-device-alias]
 OK> nvalias [device-alias] [device-path]
 OK> setenv boot-device [device-alias]
 OK> setenv diag-device [device-alias]  (if desired)

  9. Boot off the new path into single user mode.  Verify that the
     devices were created as expected

 ok> boot -s

 # format

     If Veritas Volume Manager was NOT used, check that the
     devices can be mounted.  If all looks good continue
     to      multiuser.

 # mountall


  10. If Veritas was disabled previously (Step 1), re-enable it
   and reboot.

 # rm /etc/vx/reconfig.d/state.d/install-db
 # init 6

     Verify that Veritas started correctly and all volumes are
     available.

Corrective Action:

Important! Troubleshoot pcisch driver panics as outlined above and
           in FIN I0852-1 and follow instructions outlined in the
           Special Considerations section.

  A. Replace all 375-3030-xx (Crystal+) cards with 375-3108-xx
     (Crystal-2A) cards in the affected domain.

OR

  B. If customer attached A3500FC (540-4026 or 540-4027) to F12/15K
     via Crystal+, replace all 375-3030-xx (Crystal+) cards  with
     375-3019-xx (Amber) cards in the affected domain.

Either action will require new drivers to be installed and LC-SC
or LC-LC Fibre Cables.  See Product Note 816-5002 for details:

   http://infoserver.central/data/816/816-5002/pdf/816-5002-11.pdf

Comments:

Billing Type:

Warranty: Sun will provide parts at no charge under Warranty
           Service. On-Site Labor Rates are based on how the
           system was initially installed.

 Contract: Sun will provide parts at no charge. On-Site Labor Rates
           are based on the type of service contract.

 Non Contract: Sun will provide parts at no charge. Installation by
               Sun is available based on the On-Site Labor Rates
               defined in the Price List.

--------------------------------------------------------------------------

Implementation Footnote:

________________________

i)   In case of Mandatory FCOs, Sun Services will attempt to contact
      all known customers to recommend the part upgrade.

ii)  For controlled proactive swap FCOs, Sun Services mission critical
     support teams will initiate proactive swap efforts for their respective
     accounts, as required.

iii) For Replace upon Failure FCOs, Sun Services partners will implement
     the necessary corrective actions as and when they are required.

--------------------------------------------------------------------------

All released FINs and FCOs can be accessed using your favorite network
browser as follows:

SunWeb Access:
______________

* Access the top level URL of http://sdpsweb.Central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.

SunSolve Online Access:
_______________________

* Access the SunSolve Online URL at http://sunsolve.Central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
_______________

* Access the top level URL of  https://spe.sun.com

--------------------------------------------------------------------------
General:
________

Send questions or comments to finfco-manager@Sun.COM

---------------------------------------------------------------------------

Status

inactive