Replacing a drive on an Sun Fire[TM] X4500 that has not been explicitly failed by ZFS

Asset ID:	1-72-1011391.1
Update Date:	2011-05-31
Keywords:

Solution Type Problem Resolution Sure

Solution 1011391.1 : Replacing a drive on an Sun Fire[TM] X4500 that has not been explicitly failed by ZFS

Related Items


Sun Fire X4500 Server
 Solaris SPARC Operating System

Related Categories


GCS>Sun Microsystems>Servers>x64 Servers

PreviouslyPublishedAs
215625

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems

Symptoms
There are instances when an Sun Fire[TM] X4500's drive firmware SMART (Self-Monitoring Analysis and Reporting Technology) predictively fails out a disk and reports it to fmadm.

ZFS however can report it as healthy. Running cfgadm -c unconfigure is how the service manual recommends replacing the drive. However, since the drive was still healthy according to ZFS, the command will fail with the following:

root@th12 # cfgadm -c unconfigure sata1/7::dsk/c1t7d0

Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

This operation will suspend activity on the SATA device

Continue (yes/no)  yes

cfgadm: Hardware specific failure: Failed to unconfig device at ap_id: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

Resolution
Since the drive was still healthy according to ZFS it needs to
be offlined, cfgadm unconfigured, physically replaced, cfgadm
configured, and finally zpool replaced

1. Prior to replacing the drive, cfgadm -alv , will show the following output

root@th12 # cfgadm -alv

Ap_Id                          Receptacle   Occupant     Condition  Information

When         Type         Busy     Phys_Id

sata0/0::dsk/c0t0d0            connected    configured   ok         Mod: HITACHI HDS7250SASUN500G 0627K7KP8F FRev: K2AOAJ0A SN: KRVN67ZAJ7KP8F

unavailable  disk         n        /devices/pci@0,0/pci1022,7458@1/pci11ab,11ab@1:0

sata0/1::dsk/c0t1d0            connected    configured   ok         Mod: HITACHI HDS7250SASUN500G 0628KB06EF FRev: K2AOAJ0A SN: KRVN65ZAJB06EF

unavailable  disk         n        /devices/pci@0,0/pci1022,7458@1/pci11ab,11ab@1:1

(output ommitted for brevity)


sata1/7::dsk/c1t7d0            connected    configured   ok         Mod: HITACHI HDS7250SASUN500G 0628K8RH1D FRev: K2AOAJ0A SN: KRVN63ZAJ8RH1D


unavailable  disk         n        /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

2. fmadm and fmdump will show the drives as faulty:

root@th12 # fmadm faulty

STATE RESOURCE / UUID

-------- ----------------------------------------------------------------------

degraded hc:///:serial=KRVN63ZAJ8RH1D/component=sata1/7

665c1b1a-7405-6f8a-adc5-be4e32dc9232

-------- ----------------------------------------------------------------------

root@th12 # fmdump

TIME                 UUID                                 SUNW-MSG-ID

Dec 01 00:23:09.5984 665c1b1a-7405-6f8a-adc5-be4e32dc9232 DISK-8000-0X

root@th12 # fmdump -v

TIME                 UUID                                 SUNW-MSG-ID

Dec 01 00:23:09.5984 665c1b1a-7405-6f8a-adc5-be4e32dc9232 DISK-8000-0X

100%  fault.io.disk.predictive-failure

Problem in: hc:///:serial=KRVN63ZAJ8RH1D:part=HITACHI-HDS7250SASUN500G-628K8RH1D:revision=K2AOAJ0A/motherboard=0/hostbridge=0/
pcibus=0/pcidev=2/pcifn=0/pcibus=2/pcidev=1/pcifn=0/sata-port=7/disk=0

Affects: hc:///:serial=KRVN63ZAJ8RH1D/component=sata1/7

FRU: hc:///component=HD_ID_45

root@th12 #

3. Format will show the following:

root@th12 # format

Searching for disks...done

AVAILABLE DISK SELECTIONS:

0. c0t0d0 <ATA-HITACHI HDS7250S-AJ0A-465.76GB>

/pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@0,0

1. c0t1d0 <ATA-HITACHI HDS7250S-AJ0A-465.76GB>

/pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@1,0

(output ommitted for brevity)

14. c1t6d0 <ATA-HITACHI HDS7250S-AJ0A-465.76GB>

/pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@6,0


     15. c1t7d0 <ATA-HITACHI HDS7250S-AJ0A-465.76GB>


         /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@7,0

Specify disk (enter its number): 15

selecting c1t7d0

[disk formatted]


/dev/dsk/c1t7d0s0 is part of active ZFS pool zpool1. Please see zpool(1M).

FORMAT MENU:

disk       - select a disk

type       - select (define) a disk type

partition  - select (define) a partition table

current    - describe the current disk

format     - format and analyze the disk

fdisk      - run the fdisk program

repair     - repair a defective sector

label      - write label to the disk

analyze    - surface analysis

defect     - defect list management

backup     - search for backup labels

verify     - read and display labels

inquiry    - show vendor, product and revision

volname    - set 8-character volume name

!<cmd>     - execute <cmd>, then return

quit

format> p

PARTITION MENU:

0      - change `0' partition

1      - change `1' partition

2      - change `2' partition

3      - change `3' partition

4      - change `4' partition

5      - change `5' partition

6      - change `6' partition

select - select a predefined table

modify - modify a predefined partition table

name   - name the current table

print  - display the current table

label  - write partition map and label to the disk

!<cmd> - execute <cmd>, then return

quit

partition> p

Current partition table (original):

Total disk sectors available: 976756749 + 16384 (reserved sectors)

Part      Tag    Flag     First Sector         Size         Last Sector

0        usr    wm                34      465.75GB          976756749

1 unassigned    wm                 0           0               0

2 unassigned    wm                 0           0               0

3 unassigned    wm                 0           0               0

4 unassigned    wm                 0           0               0

5 unassigned    wm                 0           0               0

6 unassigned    wm                 0           0               0

8   reserved    wm         976756750        8.00MB          976773133

partition> q

4. The zfs commands zpool will show that the pool is healthy and online:

root@th12 # zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT

zpool1                 20.8T   1.14M   20.8T     0%  
ONLINE
     -

root@th12 # zpool status zpool1

pool: zpool1

state: 
ONLINE

scrub: none requested

config:

NAME        STATE     READ WRITE CKSUM

zpool1      ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t0d0  ONLINE       0     0     0

c1t0d0  ONLINE       0     0     0

c4t0d0  ONLINE       0     0     0

c6t0d0  ONLINE       0     0     0

c7t0d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t1d0  ONLINE       0     0     0

c1t1d0  ONLINE       0     0     0

c4t1d0  ONLINE       0     0     0

c5t1d0  ONLINE       0     0     0

c6t1d0  ONLINE       0     0     0

c7t1d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t2d0  ONLINE       0     0     0

c1t2d0  ONLINE       0     0     0

c4t2d0  ONLINE       0     0     0

c5t2d0  ONLINE       0     0     0

c6t2d0  ONLINE       0     0     0

c7t2d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t3d0  ONLINE       0     0     0

c1t3d0  ONLINE       0     0     0

c4t3d0  ONLINE       0     0     0

c5t3d0  ONLINE       0     0     0

c6t3d0  ONLINE       0     0     0

c7t3d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t4d0  ONLINE       0     0     0

c1t4d0  ONLINE       0     0     0

c4t4d0  ONLINE       0     0     0

c6t4d0  ONLINE       0     0     0

c7t4d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t5d0  ONLINE       0     0     0

c1t5d0  ONLINE       0     0     0

c4t5d0  ONLINE       0     0     0

c5t5d0  ONLINE       0     0     0

c6t5d0  ONLINE       0     0     0

c7t5d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t6d0  ONLINE       0     0     0

c1t6d0  ONLINE       0     0     0

c4t6d0  ONLINE       0     0     0

c5t6d0  ONLINE       0     0     0

c6t6d0  ONLINE       0     0     0

c7t6d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t7d0  ONLINE       0     0     0


           c1t7d0  ONLINE       0     0     0

c4t7d0  ONLINE       0     0     0

c5t7d0  ONLINE       0     0     0

c6t7d0  ONLINE       0     0     0

c7t7d0  ONLINE       0     0     0

errors: No known data errors

root@th12 #

5. In order to replace the drive, you need to offline the drive in zfs:

root@th12 # zpool offline zpool1 c1t7d0

Bringing device c1t7d0 offline


6. The zpool status command will show the following after the drive has been offlined.

root@th12 # zpool status zpool1

pool: zpool1

state: DEGRADED

status: One or more devices has been taken offline by the adminstrator.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Online the device using 'zpool online' or replace the device with

'zpool replace'.

scrub: none requested

config:

NAME        STATE     READ WRITE CKSUM


zpool1      DEGRADED     0     0     0

raidz     ONLINE       0     0     0

c0t0d0  ONLINE       0     0     0

c1t0d0  ONLINE       0     0     0

c4t0d0  ONLINE       0     0     0

c6t0d0  ONLINE       0     0     0

c7t0d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t1d0  ONLINE       0     0     0

c1t1d0  ONLINE       0     0     0

c4t1d0  ONLINE       0     0     0

c5t1d0  ONLINE       0     0     0

c6t1d0  ONLINE       0     0     0

c7t1d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t2d0  ONLINE       0     0     0

c1t2d0  ONLINE       0     0     0

c4t2d0  ONLINE       0     0     0

c5t2d0  ONLINE       0     0     0

c6t2d0  ONLINE       0     0     0

c7t2d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t3d0  ONLINE       0     0     0

c1t3d0  ONLINE       0     0     0

c4t3d0  ONLINE       0     0     0

c5t3d0  ONLINE       0     0     0

c6t3d0  ONLINE       0     0     0

c7t3d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t4d0  ONLINE       0     0     0

c1t4d0  ONLINE       0     0     0

c4t4d0  ONLINE       0     0     0

c6t4d0  ONLINE       0     0     0

c7t4d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t5d0  ONLINE       0     0     0

c1t5d0  ONLINE       0     0     0

c4t5d0  ONLINE       0     0     0

c5t5d0  ONLINE       0     0     0

c6t5d0  ONLINE       0     0     0

c7t5d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t6d0  ONLINE       0     0     0

c1t6d0  ONLINE       0     0     0

c4t6d0  ONLINE       0     0     0

c5t6d0  ONLINE       0     0     0

c6t6d0  ONLINE       0     0     0

c7t6d0  ONLINE       0     0     0


raidz     DEGRADED     0     0     0

c0t7d0  ONLINE       0     0     0


c1t7d0  OFFLINE      0     0     0

c4t7d0  ONLINE       0     0     0

c5t7d0  ONLINE       0     0     0

c6t7d0  ONLINE       0     0     0

c7t7d0  ONLINE       0     0     0

errors: No known data errors

root@th12 #

7. Now that the drive has been offlined from the zfs pool, it can be removed from dynamically reconfigured from OS control by running the following command:

root@th12 # cfgadm -c unconfigure sata1/7::dsk/c1t7d0

Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

This operation will suspend activity on the SATA device

Continue (yes/no)  yes

root@th12 # Dec  5 14:20:02 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:20:02 th12  port 7: link lost

Dec  5 14:20:03 th12 sata: WARNING: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:20:03 th12  SATA device detached at port 7

Dec  5 14:20:29 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:20:29 th12  port 7: link lost

Dec  5 14:20:29 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:20:29 th12  port 7: link established

Dec  5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:21:30 th12  port 7: device reset

Dec  5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:21:30 th12  port 7: device reset

Dec  5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:21:30 th12  port 7: link lost

Dec  5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:21:30 th12  port 7: link established

Dec  5 14:21:30 th12 sata: WARNING: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:

Dec  5 14:21:30 th12  SATA device attached at port 7

8. Notice that the drive no longer shows up in cfgadm

root@th12 # cfgadm -al | grep t7


sata0/7::dsk/c0t7d0            disk         connected    configured   ok


sata2/7::dsk/c4t7d0            disk         connected    configured   ok

sata3/7::dsk/c5t7d0            disk         connected    configured   ok

sata4/7::dsk/c6t7d0            disk         connected    configured   ok

sata5/7::dsk/c7t7d0            disk         connected    configured ok

9. At this point the drive is safe to remove.  You should see the drive's blue light  lit up indicating that it is safe to remove it.  Physically Replace Drive.

10.  Once the drive has been physically replaced, you can configure it back into OS control by running the following command:

root@th12 # cfgadm -c configure sata1/7::dsk/c1t7d0

11. Notice that cfgadm now shows the drive again.

root@th12 # cfgadm -al | grep t7

sata0/7::dsk/c0t7d0            disk         connected    configured   ok


sata1/7::dsk/c1t7d0            disk         connected    configured   ok

sata2/7::dsk/c4t7d0            disk         connected    configured   ok

sata3/7::dsk/c5t7d0            disk         connected    configured   ok

sata4/7::dsk/c6t7d0            disk         connected    configured   ok

sata5/7::dsk/c7t7d0            disk         connected    configured   ok

root@th12 #

12. You can now put the drive into zfs control by running the below substituting your drive's c#t#d# for c1t7d0 in the example below.

root@th12 # zpool replace zpool1 c1t7d0 c1t7d0

13. The pool is now healthy again

root@th12 # zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT

zpool1                 20.8T   1.38M   20.8T     0%  ONLINE     -

Notice the message that in the zpool status command below under scrub.

root@th12 # zpool status zpool1

pool: zpool1

state: ONLINE


scrub: resilver completed with 0 errors on Tue Dec  5 14:22:46 2006

config:

NAME        STATE     READ WRITE CKSUM

zpool1      ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t0d0  ONLINE       0     0     0

c1t0d0  ONLINE       0     0     0

c4t0d0  ONLINE       0     0     0

c6t0d0  ONLINE       0     0     0

c7t0d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t1d0  ONLINE       0     0     0

c1t1d0  ONLINE       0     0     0

c4t1d0  ONLINE       0     0     0

c5t1d0  ONLINE       0     0     0

c6t1d0  ONLINE       0     0     0

c7t1d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t2d0  ONLINE       0     0     0

c1t2d0  ONLINE       0     0     0

c4t2d0  ONLINE       0     0     0

c5t2d0  ONLINE       0     0     0

c6t2d0  ONLINE       0     0     0

c7t2d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t3d0  ONLINE       0     0     0

c1t3d0  ONLINE       0     0     0

c4t3d0  ONLINE       0     0     0

c5t3d0  ONLINE       0     0     0

c6t3d0  ONLINE       0     0     0

c7t3d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t4d0  ONLINE       0     0     0

c1t4d0  ONLINE       0     0     0

c4t4d0  ONLINE       0     0     0

c6t4d0  ONLINE       0     0     0

c7t4d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t5d0  ONLINE       0     0     0

c1t5d0  ONLINE       0     0     0

c4t5d0  ONLINE       0     0     0

c5t5d0  ONLINE       0     0     0

c6t5d0  ONLINE       0     0     0

c7t5d0  ONLINE       0     0     0

raidz     ONLINE       0     0     0

c0t6d0  ONLINE       0     0     0

c1t6d0  ONLINE       0     0     0

c4t6d0  ONLINE       0     0     0

c5t6d0  ONLINE       0     0     0

c6t6d0  ONLINE       0     0     0

c7t6d0  ONLINE       0     0     0


raidz     ONLINE       0     0     0

c0t7d0  ONLINE       0     0     0


c1t7d0  ONLINE       0     0     0

c4t7d0  ONLINE       0     0     0

c5t7d0  ONLINE       0     0     0

c6t7d0  ONLINE       0     0     0

c7t7d0  ONLINE       0     0     0

errors: No known data errors

root@th12 #

14. Use the fmadm command to repair the status of the drive in the fault management service:

root@th12 #fmadm repair 665c1b1a-7405-6f8a-adc5-be4e32dc9232

Relief/Workaround

Product
Sun Fire X4500 Server
Solaris 10 Operating System for x86 Platforms
Solaris 10 Operating System

Internal Comments
Place Sun Internal-Use Only content here. This content will be published to internal SunSolve only.

x4500, disk, zfs, proactive, replace, thumper, cfgadm, zpool, resilver, scrub
Previously Published As
88150

Change History
Date: 2007-05-10
User Name: 155224
Action: Update Canceled
Comment: *** Restored Published Content *** need to make sure first...
Version: 0
Product_uuid
f4bbfa5f-e6e5-11da-ac3d-080020a9ed93|Sun Fire X4500 Server
ab47c00a-f97b-11d9-89e6-080020a9ed93|Solaris 10 Operating System for x86 Platforms
5005588c-36f3-11d6-9cec-fc96f718e113|Solaris 10 Operating System

Attachments

This solution has no attachment