[OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

Mon Apr 22 09:08:39 UTC 2013

Am 21.04.2013 06:35, schrieb openindiana-discuss-request at openindiana.org:
> ------------------------------
>
> Message: 3
> Date: Sat, 20 Apr 2013 21:13:09 -0700
> From: Richard Elling<richard.elling at richardelling.com>
> To: Discussion list for OpenIndiana
> 	<openindiana-discuss at openindiana.org>
> Subject: Re: [OpenIndiana-discuss] vdev reliability was:
> 	Recommendations for	fast storage
> Message-ID:<0B43E9EA-10FD-41AF-81EF-31644FF4913F at RichardElling.com>
> Content-Type: text/plain; charset=windows-1252
>
> Terminology warning below?
>
> On Apr 18, 2013, at 3:46 AM, Sebastian Gabler<sequoiamobil at gmx.net>  wrote:
>
>> >Am 18.04.2013 03:09, schriebopenindiana-discuss-request at openindiana.org:
>>> >>Message: 1
>>> >>Date: Wed, 17 Apr 2013 13:21:08 -0600
>>> >>From: Jan Owoc<jsowoc at gmail.com>
>>> >>To: Discussion list for OpenIndiana
>>> >>	<openindiana-discuss at openindiana.org>
>>> >>Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
>>> >>Message-ID:
>>> >>	<CADCwuEYC14mT5aGKeZ7Pda64H014T07GgtojkPQ5JS4S279X6A at mail.gmail.com>
>>> >>Content-Type: text/plain; charset=UTF-8
>>> >>
>>> >>On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalson<tsc5yc at mst.edu>   wrote:
>>>>> >>> >On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) <
>>>>> >>> >openindiana at nedharvey.com> wrote:
>>>>> >>> >
>>>>>>> >>>> >>You also said the raidz2 will offer more protection against failure,
>>>>>>> >>>> >>because you can survive any two disk failures (but no more.)  I would argue
>>>>>>> >>>> >>this is incorrect (I've done the probability analysis before).  Mostly
>>>>>>> >>>> >>because the resilver time in the mirror configuration is 8x to 16x faster
>>>>>>> >>>> >>(there's 1/8 as much data to resilver, and IOPS is limited by a single
>>>>>>> >>>> >>disk, not the "worst" of several disks, which introduces another factor up
>>>>>>> >>>> >>to 2x, increasing the 8x as high as 16x), so the smaller resilver window
>>>>>>> >>>> >>means lower probability of "concurrent" failures on the critical vdev.
>>>>>>> >>>> >>  We're talking about 12 hours versus 1 week, actual result of my machines
>>>>>>> >>>> >>in production.
>>>>>>> >>>> >>
>>>>> >>> >
>>>>> >>> >Did you also compare the probability of bit errors causing data loss
>>>>> >>> >without a complete pool failure?  2-way mirrors, when one device completely
>>>>> >>> >dies, have no redundancy on that data, and the copy that remains must be
>>>>> >>> >perfect or some data will be lost.  On the other hand, raid-z2 will still
>>>>> >>> >have available redundancy, allowing every single block to have a bad read
>>>>> >>> >on any single component disk, without losing data.  I haven't done the math
>>>>> >>> >on this, but I seem to recall some papers claiming that this is the more
>>>>> >>> >likely route to lost data on modern disks, by comparing bit error rate and
>>>>> >>> >capacity.  Of course, a second outright failure puts raid-z2 in a much
>>>>> >>> >worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
>>>>> >>> >already be a less likely case.
>>> >>Richard Elling wrote a blog post about "mean time to data loss" [1]. A
>>> >>few years later he graphed out a few cases for typical values of
>>> >>resilver times [2].
>>> >>
>>> >>[1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
>>> >>[2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html
>>> >>
>>> >>Cheers,
>>> >>Jan
>> >
>> >Notably, Richard's models posted do not include BER. Nevertheless it's an important factor.

> [..] /snip
>
>> > From the back of my mind it will impact reliability in different ways in ZFS:
>> >
>> >- Bit error in metadata (zfs should save us by metadata redundancy)
>> >- Bit error in full stripe data
>> >- Bit error in parity data
> These aren't interesting from a system design perspective. To enhance the model to deal
> with this, we just need to determine what percentage of the overall space contains copied
> data. There is no general answer, but for most systems it will be a small percentage of the
> total, as compared to data. In this respect, the models are worst-case, which is what we want
> to use for design evalulations.
As others already pointed out, the case where disk-based read errors are 
getting into focus when you read from an array/vdev that has no more 
redundancy. Indeed, there is no difference between parity and stripe 
data. There is however a difference to metadata when those are provided 
redundantly, even in a non-redundant vdev layout. That is the case in ZFS.
>
> NB, traditional RAID systems don't know what is data and what is not data, so they could
> run into uncorrectable errors that are not actually containing data. This becomes more
> important for those systems which use a destructive scrub, as opposed to ZFS's readonly
> scrub. Hence, some studies have shown where scrubbing can propagate errors in non-ZFS
> RAID arrays.
AFAIK, traditional RAID constructs may have two additional issues 
compared to ZFS.
1. As you mention, usually the whole stripe set needs to be rebuild, 
whereas resilver only rebuilds active data.
2. The error may or may not be detected by the controller. In ZFS, even 
if a read error goes silent down the whole food-chain, there are still 
block-based checksums separating chaff from wheat.
>
>> >
>> >AFAIK, a bit error in Parity or stripe data can be specifically dangerous when it is raised during resilvering, and there is only one layer of redundancy left. OTOH, BER issues scale with VDEV size, not with rebuild time. So, I think that Tim actually made up a valid point about a systematically weak point of 2-way mirrors or raidz1 on in vdevs that are large in comparison to the BER rating of their member drives. Consumer drives have a BER of 1:10^14..10^15, Enterprise drives start at 1:10^16.
>> >I do not think that zfs will have better resilience against rot of parity data than conventional RAID. At best, block level checksums can help raise an error, so you know at least that something went wrong. But recovery of the data will probably not be possible. So, in my opinion BER is an issue under ZFS as anywhere else.
> Yep, which is why my MTTDL model 2 explicitily (MTTDL[2]) considers this case;-)
Me personally had only one single zpool failing on resilver in 4 years 
of using ZFS. The underlying issue was WD Green drives not being 
compatible with LSI 1068. (Simplified, that is some kind of BER 
scenario, including disks, link, firmware, etc., because what the users 
sees in zpool status is that the data don't read correctly.) However, I 
can not give proper account of what the error scenario indeed was, 
because I was acting too chaotic myself in order to get the pool back. 
In the end, I ended up using Hitachi drives and got the data from a 
fresh backup. If I remember correctly, I had both scenarios, individual 
file corruption, and loss of the complete pool among the several 
attempts of resilvering. Now, how should be the behaviour in case of a 
single block (or a small number of them) that comes back unreadable? 
Will I loose the complete pool, or will damage be limited to the 
affected blocks, resp. files referencing these blocks? (Given, that all 
I will use is zpool.)
>
>> >
>> >Best,
>> >
>> >Sebastian
>> >PS: I occurred to me that WD doesn't publish BER data for some of their drives (at least all I have searched for while writing this). Anybody happens to be in possession of full specs for WD drives?
> The trend seems to be that BER data is not shown for laptop drives, which is a large part of
> the HDD market. Presumably, this is because the load/unload failure mode dominates in
> this use case as the drives are not continuously spinning. It is a good idea to use components
> in the environment for which they are designed, so I'm pretty sure you'd never consider using
> a laptop drive for a storage array.
I was looking at RE-4 drives. The issue was probably the BER/UER mixup. 
They have it in the data sheet as "Non-recoverable read errors per bits 
read", which is pretty obvious now. The spec is 1:10^15, for the record.

BR

Sebastian