[OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

Sebastian Gabler sequoiamobil at gmx.net
Thu Apr 18 15:24:01 UTC 2013


Am 18.04.2013 16:28, schrieb openindiana-discuss-request at openindiana.org:
> Message: 1
> Date: Thu, 18 Apr 2013 12:17:47 +0000
> From: "Edward Ned Harvey (openindiana)"<openindiana at nedharvey.com>
> To: Discussion list for OpenIndiana
> 	<openindiana-discuss at openindiana.org>
> Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
> Message-ID:
> 	<D1B1A95FBDCF7341AC8EB0A97FCCC4773BBF3411 at SN2PRD0410MB372.namprd04.prod.outlook.com>
> 	
> Content-Type: text/plain; charset="us-ascii"
>
>> >From: Timothy Coalson [mailto:tsc5yc at mst.edu]
>> >
>> >Did you also compare the probability of bit errors causing data loss
>> >without a complete pool failure?  2-way mirrors, when one device
>> >completely
>> >dies, have no redundancy on that data, and the copy that remains must be
>> >perfect or some data will be lost.
> I had to think about this comment for a little while to understand what you were saying, but I think I got it.  I'm going to rephrase your question:
>
> If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy.  So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error.  Question is, did I calculate that probability?
>
> Answer is, I think so.  Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear.  Also dependent on the specific model of drive in question, and the graphs are typically not available.  So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure.
The thing is... Bit Errors can lead to corruption of files, or even to 
the loss of a whole pool, without having an additional faulted drive. 
Because Bit Errors do not necessarily lead to a drive error. The risk of 
a rebuild failing is proportional to the BER of the drives involved, and 
it scales by the amount of data moved, given that you don't have further 
redundancy left. I agree with previous suggestions made that scrubbing 
offers some degree of protection against that issue. It doesn't do away 
with the risk when dealing with Bit Errors in a situation that has all 
redundancy stripped for some reason. For this aspect, a second level of 
redundancy offers a clear benefit.
AFAIU, that was the valid point of the poster raising the controversy 
about resilience of a single vdev with multiple redundancy vs. multiple 
vdevs with single redundancy.
As much as scrubbing is concerned, it is true that it will reduce the 
risk of a bit error rearing precisely during rebuild. However, in cases 
where you will deliberately pull redundancy, i.e. for swapping drives 
with larger ones, you will want to have a valid backup, and thus you 
will have not have too much of WORN data.  In either case, it is 
user-driven, that is not scrub by itself is pro-active, but it gives the 
user a tool to be proactive about WORN data which are indeed those 
primarily prone to bit rot.

BR

Sebastian



More information about the OpenIndiana-discuss mailing list