[OpenIndiana-discuss] Vibration issue - successful recovery

Tue Nov 5 16:04:48 UTC 2013

Hi,

I just wanted to share experience I made with an issue that seems 
similar to what Clement BRIZARD reported recently.

Monday morning I found our ZFS Backup server hanging with a message that 
a disk was gone. The machine was hanging in zpool status infinitely. 
Shutdown failed, too. So, enters hard reset.

Next, one pool came back like this:

user at server:~$ pfexec zpool status -v daten
   pool: daten
  state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
         continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov  4 08:42:32 2013
     2.53T scanned out of 8.35T at 1.03G/s, 1h36m to go
     148G resilvered, 30.29% done
config:

         NAME          STATE     READ WRITE CKSUM
         daten         DEGRADED     0     0     0
           raidz1-0    DEGRADED     0     0     0
             c1t1d0    ONLINE       0     0     0
             c1t2d0    ONLINE       0     0     0
             c1t3d0    ONLINE       0     0     0
             c1t4d0    ONLINE       0     0     0
             c1t5d0    ONLINE       0     0     0  (resilvering)
             spare-5   DEGRADED     0     0     0
               c1t6d0  DEGRADED     0     0     0  too many errors 
(resilvering)
               c3t5d0  ONLINE       0     0     0  (resilvering)
             c1t7d0    ONLINE       0     0     0
           raidz1-1    ONLINE       0     0     0
             c3t3d0    ONLINE       0     0     0
             c3t4d0    ONLINE       0     0     4
             c3t6d0    ONLINE       0     0     0
             c3t7d0    ONLINE       0     0     0
             c3t8d0    ONLINE       0     0     0
             c3t9d0    ONLINE       0     0     0
             c3t10d0   ONLINE       0     0     0
         spares
           c3t5d0      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

         /daten/backup/yadayada/0F70B9DA58BBB33F.avi
         /daten/backup/server/etc/devices/devname_cache
         daten/nfs/esx4-bckp:<0x518>
         daten/nfs/esx4-bckp:<0x551>

The resilver aborted after many error on about 4 disks in iostat -en, 
and restarted:

  pool: daten
  state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
         continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov  4 10:01:06 2013
     328G scanned out of 8.35T at 825M/s, 2h49m to go
     23.4G resilvered, 3.84% done
config:

         NAME          STATE     READ WRITE CKSUM
         daten         DEGRADED     0     0     0
           raidz1-0    DEGRADED     0     0     0
             c1t1d0    ONLINE       0     0     0
             c1t2d0    ONLINE       0     0     0
             c1t3d0    ONLINE       0     0     0
             c1t4d0    ONLINE       0     0     0
             c1t5d0    ONLINE       0     0     0  (resilvering)
             spare-5   DEGRADED     0     0     0
               c1t6d0  DEGRADED     0     0     0  too many errors 
(resilvering)
               c3t5d0  ONLINE       0     0     0  (resilvering)
             c1t7d0    ONLINE       0     0     0
           raidz1-1    ONLINE       0     0     0
             c3t3d0    ONLINE       0     0     0
             c3t4d0    ONLINE       0     0     4
             c3t6d0    ONLINE       0     0     0
             c3t7d0    ONLINE       0     0     0
             c3t8d0    ONLINE       0     0     0
             c3t9d0    ONLINE       0     0     0
             c3t10d0   ONLINE       0     0     0
         spares
           c3t5d0      INUSE     currently in use

errors: No known data errors

This happened three more times, meanwhile the CKSUM count was down to 0, 
but as the error counts in iostat were growing, after another machine 
hang I powered down the machine, and pulled all disk trays, and 
resettled the internal SAS connectors.

Resilver resumed and finished at the next re-boot, and after a zpool 
error clear, everything was back to normal. SMART on the drives throwing 
errors in iostat was fine. So, it was basically a transport problem 
causing all the havoc.

I reckon that I am set. Any obvious miss outs? What could I have done 
better?
What I am still not 100% clear with is how 3 disks can resilver in a 
single redundancy vdev successfully.

BR

Sebastian