[OpenIndiana-discuss] Resilver restarting on second dead drive?

Thu Feb 9 23:21:17 UTC 2012

On Thu, Feb 9, 2012 at 2:13 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:
>> I don't quite understand what happened in your specific case. Let's
>> say you had a setup:
>> raidz2 c1d0 c1d1 c1d2 c1d3 spare c1d4 c1d5
>>
>> Let's say c1d3 failed. Resilver started and d4 replaced d3's place -
>> you now have a non-degraded raidz2. You then physically swapped out d3
>> for a new drive and did "zpool replace". Until the replace command
>> completes, you still have the fully-functioning zpool of c1d0 c1d1
>> c1d2 c1d4. When another drive, eg. c1d2, fails, I would hope the
>> replace command is cancelled (it's cosmetic - d4 is doing fine instead
>> of d3) and instead the array is resilvered with c2d5 in place of c1d2.
>>
>> Is this what happened (other than the specific disk numbers)?
>
> What happened was this:
>
> Server Urd has four RAIDz2 VDEVs, somewhat non-optimally balanced (because of a few factors, lack of time the dominant one), so the largest has 12 drives (the other 7). In this VDEV, c14t19d0 died, and the common spare, c9t7d0, stepped in. I replaced c14t19d0 (zpool offline, cfgadm -c unconfigure ... zpool replace dpool c14t19d0 c14t19d0, zpool detach dpool c9t7d0). So, all ok, resilver was almost done when c14t12d0 died and c9t7d0 took over once more. Now, resilver was restarted, and is still running (high load on the pool as well).

This is a side comment: you should only have run the "zpool detach
dpool c9t7d0" *after* the pool was done resilvering back onto the new
c14t19d0.

> Now, I can somewhat see the argument in resilvering more drives in parallel to save time, if the drives fail at the same time, but how often do they really do that? Mostly, a drive will fail rather out of sync with others. This leads me to thinking it would be better to let the pool resilver the first device dying and then go on with the second, or perhaps allow for manual override somewhere.
>
> What are your thoughts?

I agree there is a tradeoff between letting a resilver finish and
attempting to replace the newly-failed drive asap. I would probably
set the threshold at 50% - if a current resilver is >= 50% complete,
let it finish (if possible) before working on the next drive.

I think you will get a better explanation on the (still active)
"zfs-discuss at opensolaris.org" mailing list. Someone on that list might
explain the design decision (or if it was simply an arbitrary choice).

Jan