[OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

Richard Elling richard.elling at richardelling.com
Fri Apr 19 23:05:55 UTC 2013


[catching up... comment below]

On Apr 18, 2013, at 2:03 PM, Timothy Coalson <tsc5yc at mst.edu> wrote:

> On Thu, Apr 18, 2013 at 10:24 AM, Sebastian Gabler <sequoiamobil at gmx.net>wrote:
> 
>> Am 18.04.2013 16:28, schrieb openindiana-discuss-request@**openindiana.org<openindiana-discuss-request at openindiana.org>
>> :
>> 
>>> Message: 1
>>> Date: Thu, 18 Apr 2013 12:17:47 +0000
>>> From: "Edward Ned Harvey (openindiana)"<openindiana@**nedharvey.com<openindiana at nedharvey.com>
>>>> 
>>> 
>>> To: Discussion list for OpenIndiana
>>>        <openindiana-discuss@**openindiana.org<openindiana-discuss at openindiana.org>
>>>> 
>>> Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
>>> Message-ID:
>>>        <D1B1A95FBDCF7341AC8EB0A97FCCC**4773BBF3411 at SN2PRD0410MB372.**
>>> namprd04.prod.outlook.com<D1B1A95FBDCF7341AC8EB0A97FCCC4773BBF3411 at SN2PRD0410MB372.namprd04.prod.outlook.com>
>>>> 
>>> 
>>> Content-Type: text/plain; charset="us-ascii"
>>> 
>>>> From: Timothy Coalson [mailto:tsc5yc at mst.edu]
>>>> 
>>>>> 
>>>>> Did you also compare the probability of bit errors causing data loss
>>>>> without a complete pool failure?  2-way mirrors, when one device
>>>>> completely
>>>>> dies, have no redundancy on that data, and the copy that remains must be
>>>>> perfect or some data will be lost.
>>>> 
>>> I had to think about this comment for a little while to understand what
>>> you were saying, but I think I got it.  I'm going to rephrase your question:
>>> 
>>> If one device in a 2-way mirror becomes unavailable, then the remaining
>>> device has no redundancy.  So if a bit error is encountered on the (now
>>> non-redundant) device, then it's an uncorrectable error.  Question is, did
>>> I calculate that probability?
>>> 
>>> Answer is, I think so.  Modelling the probability of drive failure
>>> (either complete failure or data loss) is very complex and non-linear.
>>> Also dependent on the specific model of drive in question, and the graphs
>>> are typically not available.  So what I did was to start with some MTBDL
>>> graphs that I assumed to be typical, and then assume every data-loss event
>>> meant complete drive failure.
>>> 
>> The thing is... Bit Errors can lead to corruption of files, or even to the
>> loss of a whole pool, without having an additional faulted drive. Because
>> Bit Errors do not necessarily lead to a drive error. The risk of a rebuild
>> failing is proportional to the BER of the drives involved, and it scales by
>> the amount of data moved, given that you don't have further redundancy
>> left. I agree with previous suggestions made that scrubbing offers some
>> degree of protection against that issue. It doesn't do away with the risk
>> when dealing with Bit Errors in a situation that has all redundancy
>> stripped for some reason. For this aspect, a second level of redundancy
>> offers a clear benefit.
>> AFAIU, that was the valid point of the poster raising the controversy
>> about resilience of a single vdev with multiple redundancy vs. multiple
>> vdevs with single redundancy.
>> As much as scrubbing is concerned, it is true that it will reduce the risk
>> of a bit error rearing precisely during rebuild. However, in cases where
>> you will deliberately pull redundancy, i.e. for swapping drives with larger
>> ones, you will want to have a valid backup, and thus you will have not have
>> too much of WORN data.  In either case, it is user-driven, that is not
>> scrub by itself is pro-active, but it gives the user a tool to be proactive
>> about WORN data which are indeed those primarily prone to bit rot.
>> 
> 
> Yes, that was my point, bit errors when there is no remaining redundancy
> are unrecoverable.  Thus, as long is it is likely that only 1 disk per vdev
> fails at a time, raid-z2 will survive these bit errors fine, while 2-way
> mirrors will lose data.  One of Elling's posts cited by another poster did
> take that into account, but the graphs don't load due to url changes (and
> they seem to have the wrong MIME type with the fixed URL, I ended up using
> wget):
> 
> https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
> 
> He linearized the bit error rate part, also.  Beware the different scale on
> the graphs - at any rate, his calculation arrived at 3 orders of magnitude
> difference, with raid-z2 as better than 2-way mirrors for MTTDL for
> equivalent usable space (if I'm reading those correctly).

Yes. 2-way mirror is a single-parity protection scheme and raidz2 is a 
double-parity protection scheme.

> 
> Scrubbing introduces a question about the meaning of bit error rate - how
> different is the uncorrectable bit error rate on newly written data, versus
> the bit error rate on data that has been read back in successfully
> (additional qualifier: multiple times)?  Regular scrubs can change the
> MTTDL dramatically if these probabilities are significantly different,
> because the first probability only applies to data written since the most
> recent scrub, which can drop a few orders of magnitude from the calculation.

Pragmatically, it doesn't matter because the drive vendors do not publish the
information. So you'll have to measure it yourself, which is not an easy task
even if you have a sufficiently large population to get statistically significant
results. The drives are sufficiently different that statistics from one are not likely
to apply to another, so the problem becomes impractical for a typical site to
manage. Not by coincidence, I'm developing a solution for this :-). Contact me
offline if your large site is interested.

> As for what I said about resilver speed, I had not accounted for the fact
> that data reads on a raid-z2 component device would be significantly
> shorter than for the same data on 2-way mirrors.  Depending on whether you
> are using enormous block sizes, or whether your data is allocated extremely
> linearly in the way scrub/resilver reads it, this could be the limiting
> factor on platter drives due to seek times, and make raid-z2 take much
> longer to resilver.  I fear I was thinking of raid-z2 in terms of raid6.

In general, the time-based reliability of modern drives is high enough that the
BER dominates the second failure probability during resilver. For a 1-week 
resilver time the probability that an enterprise-grade disk will fail completely 
(based on MTBF) is something like 0.01%. 
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422





More information about the OpenIndiana-discuss mailing list