[OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

Thu Apr 18 21:03:32 UTC 2013

On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gabler <sequoiamobil at gmx.net>wrote:

> Am 18.04.2013 16:28, schrieb openindiana-discuss-request@**openindiana.org<openindiana-discuss-request at openindiana.org>
> :
>
>> Message: 1
>> Date: Thu, 18 Apr 2013 12:17:47 +0000
>> From: "Edward Ned Harvey (openindiana)"<openindiana@**nedharvey.com<openindiana at nedharvey.com>
>> >
>>
>> To: Discussion list for OpenIndiana
>>         <openindiana-discuss@**openindiana.org<openindiana-discuss at openindiana.org>
>> >
>> Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
>> Message-ID:
>>         <D1B1A95FBDCF7341AC8EB0A97FCCC**4773BBF3411 at SN2PRD0410MB372.**
>> namprd04.prod.outlook.com<D1B1A95FBDCF7341AC8EB0A97FCCC4773BBF3411 at SN2PRD0410MB372.namprd04.prod.outlook.com>
>> >
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>>  >From: Timothy Coalson [mailto:tsc5yc at mst.edu]
>>>
>>> >
>>> >Did you also compare the probability of bit errors causing data loss
>>> >without a complete pool failure?  2-way mirrors, when one device
>>> >completely
>>> >dies, have no redundancy on that data, and the copy that remains must be
>>> >perfect or some data will be lost.
>>>
>> I had to think about this comment for a little while to understand what
>> you were saying, but I think I got it.  I'm going to rephrase your question:
>>
>> If one device in a 2-way mirror becomes unavailable, then the remaining
>> device has no redundancy.  So if a bit error is encountered on the (now
>> non-redundant) device, then it's an uncorrectable error.  Question is, did
>> I calculate that probability?
>>
>> Answer is, I think so.  Modelling the probability of drive failure
>> (either complete failure or data loss) is very complex and non-linear.
>>  Also dependent on the specific model of drive in question, and the graphs
>> are typically not available.  So what I did was to start with some MTBDL
>> graphs that I assumed to be typical, and then assume every data-loss event
>> meant complete drive failure.
>>
> The thing is... Bit Errors can lead to corruption of files, or even to the
> loss of a whole pool, without having an additional faulted drive. Because
> Bit Errors do not necessarily lead to a drive error. The risk of a rebuild
> failing is proportional to the BER of the drives involved, and it scales by
> the amount of data moved, given that you don't have further redundancy
> left. I agree with previous suggestions made that scrubbing offers some
> degree of protection against that issue. It doesn't do away with the risk
> when dealing with Bit Errors in a situation that has all redundancy
> stripped for some reason. For this aspect, a second level of redundancy
> offers a clear benefit.
> AFAIU, that was the valid point of the poster raising the controversy
> about resilience of a single vdev with multiple redundancy vs. multiple
> vdevs with single redundancy.
> As much as scrubbing is concerned, it is true that it will reduce the risk
> of a bit error rearing precisely during rebuild. However, in cases where
> you will deliberately pull redundancy, i.e. for swapping drives with larger
> ones, you will want to have a valid backup, and thus you will have not have
> too much of WORN data.  In either case, it is user-driven, that is not
> scrub by itself is pro-active, but it gives the user a tool to be proactive
> about WORN data which are indeed those primarily prone to bit rot.
>

Yes, that was my point, bit errors when there is no remaining redundancy
are unrecoverable.  Thus, as long is it is likely that only 1 disk per vdev
fails at a time, raid-z2 will survive these bit errors fine, while 2-way
mirrors will lose data.  One of Elling's posts cited by another poster did
take that into account, but the graphs don't load due to url changes (and
they seem to have the wrong MIME type with the fixed URL, I ended up using
wget):

https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl

He linearized the bit error rate part, also.  Beware the different scale on
the graphs - at any rate, his calculation arrived at 3 orders of magnitude
difference, with raid-z2 as better than 2-way mirrors for MTTDL for
equivalent usable space (if I'm reading those correctly).

Scrubbing introduces a question about the meaning of bit error rate - how
different is the uncorrectable bit error rate on newly written data, versus
the bit error rate on data that has been read back in successfully
(additional qualifier: multiple times)?  Regular scrubs can change the
MTTDL dramatically if these probabilities are significantly different,
because the first probability only applies to data written since the most
recent scrub, which can drop a few orders of magnitude from the calculation.

As for what I said about resilver speed, I had not accounted for the fact
that data reads on a raid-z2 component device would be significantly
shorter than for the same data on 2-way mirrors.  Depending on whether you
are using enormous block sizes, or whether your data is allocated extremely
linearly in the way scrub/resilver reads it, this could be the limiting
factor on platter drives due to seek times, and make raid-z2 take much
longer to resilver.  I fear I was thinking of raid-z2 in terms of raid6.

Tim