[OpenIndiana-discuss] Defect zpool after replacing hard discs

Thorsten Heit thorsten.heit at vkb.de
Fri Jul 19 12:46:57 UTC 2013


Hi,

after some more attempts to get this to work, I had a look at the old 
500GB discs labels with zdb:


First disc (offline), (still) doesn't work actually


Second disc:


root at 7iv05-server-1:~# zdb -l /dev/dsk/c3t2d0s0
------------------------------------------
LABEL 0
------------------------------------------
    timestamp: 1374065265 UTC: Wed Jul 17 14:47:45 2013
    version: 34
    name: 'daten'
    state: 0
    txg: 3456874
    pool_guid: 6681551211158838832
    hostid: 4822034
    hostname: '7iv05-server-1'
    top_guid: 11561997106905098192
*** guid: 4104034492690796462
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 11561997106905098192
        nparity: 1
        metaslab_array: 27
        metaslab_shift: 33
        ashift: 9
        asize: 1500182740992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 6842245073227127936
            path: '/dev/dsk/c3t1d0s0'
            devid: 'id1,sd at n600508b1001032323520202020200005/a'
            phys_path: 
'/pci at 0,0/pci8086,25e5 at 5/pci1166,103 at 0/pci103c,3211 at 8/sd at 1,0:a'
            whole_disk: 1
            DTL: 8549
            create_txg: 4
            msgid: 'ZFS-8000-QJ'
        children[1]:
            type: 'disk'
            id: 1
***         guid: 4104034492690796462
            path: '/dev/dsk/c3t2d0s0'
            devid: 'id1,sd at n600508b1001032323520202020200006/a'
            phys_path: 
'/pci at 0,0/pci8086,25e5 at 5/pci1166,103 at 0/pci103c,3211 at 8/sd at 2,0:a'
            whole_disk: 1
            DTL: 541
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 12387530649057544248
            path: '/dev/dsk/c3t3d0s0'
            devid: 'id1,sd at n600508b1001032323520202020200007/a'
            phys_path: 
'/pci at 0,0/pci8086,25e5 at 5/pci1166,103 at 0/pci103c,3211 at 8/sd at 3,0:a'
            whole_disk: 1
            DTL: 540
            create_txg: 4
------------------------------------------
LABEL 1 - CONFIG MATCHES LABEL 0
------------------------------------------
------------------------------------------
LABEL 2 - CONFIG MATCHES LABEL 0
------------------------------------------
------------------------------------------
LABEL 3 - CONFIG MATCHES LABEL 0
------------------------------------------
root at 7iv05-server-1:~# 



Third disc:

root at 7iv05-server-1:~# zdb -l /dev/dsk/c3t3d0s0 
------------------------------------------
LABEL 0
------------------------------------------
    timestamp: 1374073880 UTC: Wed Jul 17 17:11:20 2013
    version: 34
    name: 'daten'
    state: 0
    txg: 3458475
    pool_guid: 6681551211158838832
    hostid: 4822034
    hostname: '7iv05-server-1'
    top_guid: 11561997106905098192
    guid: 12387530649057544248
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 11561997106905098192
        nparity: 1
        metaslab_array: 27
        metaslab_shift: 33
        ashift: 9
        asize: 1500182740992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 6842245073227127936
            path: '/dev/dsk/c3t1d0s0'
            devid: 'id1,sd at n600508b1001032323520202020200005/a'
            phys_path: 
'/pci at 0,0/pci8086,25e5 at 5/pci1166,103 at 0/pci103c,3211 at 8/sd at 1,0:a'
            whole_disk: 1
            DTL: 8549
            create_txg: 4
            msgid: 'ZFS-8000-QJ'
        children[1]:
            type: 'disk'
            id: 1
***         guid: 18169089482844988353
            path: '/dev/dsk/c3t2d0s0'
            devid: 'id1,sd at n600508b1001032323520202020200006/a'
            phys_path: 
'/pci at 0,0/pci8086,25e5 at 5/pci1166,103 at 0/pci103c,3211 at 8/sd at 2,0:a'
            whole_disk: 1
            DTL: 21
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 12387530649057544248
            path: '/dev/dsk/c3t3d0s0'
            devid: 'id1,sd at n600508b1001032323520202020200007/a'
            phys_path: 
'/pci at 0,0/pci8086,25e5 at 5/pci1166,103 at 0/pci103c,3211 at 8/sd at 3,0:a'
            whole_disk: 1
            DTL: 540
            create_txg: 4
------------------------------------------
LABEL 1 - CONFIG MATCHES LABEL 0
------------------------------------------
------------------------------------------
LABEL 2 - CONFIG MATCHES LABEL 0
------------------------------------------
------------------------------------------
LABEL 3 - CONFIG MATCHES LABEL 0
------------------------------------------
root at 7iv05-server-1:~# 


Obviously the guids of child #1 don't match; I assume because in disc #3 
there's the new guid contained from the replaced disc #2, whereas disc #2 
contains its original id...


Is there any possibility to, say, manually change this guid somehow?
For example by manually reading some data block from the disc, change 
several bytes, store it back etc.?


Regards

Thorsten



Thorsten Heit/H5/VKB schrieb am 19.07.2013 09:13:16:

> Von: Thorsten Heit/H5/VKB
> An: openindiana-discuss at openindiana.org
> Datum: 19.07.2013 09:13
> Betreff: Defect zpool after replacing hard discs
> 
> Hi,
> 
> I hope someone of you can shed a light on a problem with a zpool I now 
have...
> 
> I have a HP ProLiant server with 3x 160GB discs configured as a 
> hardware RAID5 on the built-in SmartArray E200i controller. Another 
> three 500GB discs are attached to the controller, each one as 
> (controller-side) raid 0, and configured in the OS as a raidz1 pool.
> Recently one of the 500GB was marked in the zpool as offline as I 
> could see in the console by "zpool status -v"; obviously because it 
> repeatedly threw some errors.
> 
> To ensure that the disc is really faulty I rebooted into a Linux 
> live CD image I had at hand (Ubunu 9.10, to be precisely), and tried
> a simple "dd if=/dev/... of=/dev/null" which resulted in a read 
> error after a short time. Afterwards the hardware raid controller 
> took the disc offline and marked it as faulty - a red warning light 
> on the disc cage was turned on.
> 
> Because spare parts from HP are exorbitant expensive and come with 
> only 6 months of warranty I thought I'd better buy three new and 
> bigger discs to replace the old 500GB ones, and ended up buying 3x 
> WD1003FBYX from Western Digital (enterprise discs, 1TB) that come 
> with 5 years warranty.
> 
> I took the faulty disc offline ("zpool offline daten c3t1d0"), 
> replaced it with the newer, bigger one, booted the server, and let 
> the resilvering begin via "zpool daten replace c3t1d0 c3t1d0". After
> some hours, everything seemed to be fine, the zpool was healthy 
> again. Then the same procedure with the second disc: offline, 
> replacing, rebooting, resilvering, waiting => finished, the pool was
> healty ("zpool status -v").
> 
> The problems began when I repeated the prodecure with the last disc:
> 
> Some minutes after the resilvering was started messages appeared on 
> the console (also available in the logs under /var/adm) that made me
> feel something bad is going to happen:
> 
> NOTICE:  Smart Array E200i Controller 
>  Fatal drive error, Port: 2I Box: 1 Bay: 5 
> 
> Because it was time to leave off work, I let the system continue its
> work. When I came back the next day to control the results, it was 
> quite plain to me that there's really something bad:
> "zpool status -v" listed several millions of checksum errors at 
> c3t1d0 and c3t2d0, the two already (successfully) replaced discs. 
> IIRC one of them had also lots of write errors and was taken 
> offline, and zpool told me the pool was damaged :-(
> 
> I thought, reboot is good, took the Linux CD to check whether 
> there's really a hardware error with the replaced discs - I can't 
> imagine that a new disc (manufactured end of may this year according
> to a label on the disc) that was working fine until a reboot is 
> suddenly faulty -, entered the already mentioned dd if=/dev/... of=/
> dev/null, but that seemed to work for c3t1d0. Strange.
> One reboot later the hardware RAID controller directly marked that 
> disc as faulty for unknown reason and took it offline. I powered 
> down the machine, removed the disc, placed it on another port, but 
> the result stayed the same: The controller won't use the disc anymore.
> 
> Now I have the following problem:
> Theroretically the two working 500GB discs still contain the 
> original pool data, but I cannot import the pool anymore after I put
> those discs back into the server:
> 
> root at 7iv05-server-1:~# zpool import 
>   pool: daten 
>     id: 6681551211158838832 
>  state: UNAVAIL 
> status: One or more devices are unavailable.
> action: The pool cannot be imported due to unavailable devices or data. 
> config: 
> 
>         daten       UNAVAIL  insufficient replicas 
>           raidz1-0  UNAVAIL  insufficient replicas 
>             c3t1d0  UNAVAIL  cannot open 
>             c3t2d0  UNAVAIL  corrupted data 
>             c3t3d0  ONLINE
> 
> device details: 
> 
>         c3t1d0    UNAVAIL         cannot open 
>         status: ZFS detected errors on this device. 
>                 The device was missing. 
> 
>         c3t2d0    UNAVAIL         corrupted data 
>         status: ZFS detected errors on this device. 
>                 The device has bad label or disk contents. 
> 
> root at 7iv05-server-1:~# 
> 
> c3t1d0 is the defect one; the other ones should still be working, 
> and c3t3d0 is the one that had to replaced last. But: Why does zpool
> report corrupted data?
> 
> Similar problems arise when I replace c3t1d0 and c3t2d0 with the new
> discs, boot and try "zpool import". The first disc isn't activated 
> by the controller (see above), don't know why, and the cpqary3 
> driver reports the following message when the systems is booting:
> 
> Physical drive failure, Port: 1I Box: 1 Bay: 4 
> Failure Reason............ Not Ready - Bad Sense Code 
> 
> i.e. zpool doesn't see the disc because it is obviously offline; the
> second disc is marked as online, and the third is claimed to contain
> corrupted data. In the meantime I tried several combinations of old 
> discs / new discs, but after booting at most one of them is marked 
> as online, the other ones unavailable or otherwise offline.
> 
> Do you know a trick how I can reactivate at least temporary the old 
> pool so I can backup its data?
> 
> Regards
> 
> Thorsten


More information about the OpenIndiana-discuss mailing list