[OpenIndiana-discuss] Vibration issue - successful recovery
Sebastian Gabler
sequoiamobil at gmx.net
Tue Nov 5 16:04:48 UTC 2013
Hi,
I just wanted to share experience I made with an issue that seems
similar to what Clement BRIZARD reported recently.
Monday morning I found our ZFS Backup server hanging with a message that
a disk was gone. The machine was hanging in zpool status infinitely.
Shutdown failed, too. So, enters hard reset.
Next, one pool came back like this:
user at server:~$ pfexec zpool status -v daten
pool: daten
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Nov 4 08:42:32 2013
2.53T scanned out of 8.35T at 1.03G/s, 1h36m to go
148G resilvered, 30.29% done
config:
NAME STATE READ WRITE CKSUM
daten DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0 (resilvering)
spare-5 DEGRADED 0 0 0
c1t6d0 DEGRADED 0 0 0 too many errors
(resilvering)
c3t5d0 ONLINE 0 0 0 (resilvering)
c1t7d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 4
c3t6d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
spares
c3t5d0 INUSE currently in use
errors: Permanent errors have been detected in the following files:
/daten/backup/yadayada/0F70B9DA58BBB33F.avi
/daten/backup/server/etc/devices/devname_cache
daten/nfs/esx4-bckp:<0x518>
daten/nfs/esx4-bckp:<0x551>
The resilver aborted after many error on about 4 disks in iostat -en,
and restarted:
pool: daten
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Nov 4 10:01:06 2013
328G scanned out of 8.35T at 825M/s, 2h49m to go
23.4G resilvered, 3.84% done
config:
NAME STATE READ WRITE CKSUM
daten DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0 (resilvering)
spare-5 DEGRADED 0 0 0
c1t6d0 DEGRADED 0 0 0 too many errors
(resilvering)
c3t5d0 ONLINE 0 0 0 (resilvering)
c1t7d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 4
c3t6d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
spares
c3t5d0 INUSE currently in use
errors: No known data errors
This happened three more times, meanwhile the CKSUM count was down to 0,
but as the error counts in iostat were growing, after another machine
hang I powered down the machine, and pulled all disk trays, and
resettled the internal SAS connectors.
Resilver resumed and finished at the next re-boot, and after a zpool
error clear, everything was back to normal. SMART on the drives throwing
errors in iostat was fine. So, it was basically a transport problem
causing all the havoc.
I reckon that I am set. Any obvious miss outs? What could I have done
better?
What I am still not 100% clear with is how 3 disks can resilver in a
single redundancy vdev successfully.
BR
Sebastian
More information about the OpenIndiana-discuss
mailing list