[OpenIndiana-discuss] Recommendations for fast storage

Sun Apr 21 04:36:20 UTC 2013

comment below…

On Apr 18, 2013, at 5:17 AM, Edward Ned Harvey (openindiana) <openindiana at nedharvey.com> wrote:

>> From: Timothy Coalson [mailto:tsc5yc at mst.edu]
>> 
>> Did you also compare the probability of bit errors causing data loss
>> without a complete pool failure?  2-way mirrors, when one device
>> completely
>> dies, have no redundancy on that data, and the copy that remains must be
>> perfect or some data will be lost.  
> 
> I had to think about this comment for a little while to understand what you were saying, but I think I got it.  I'm going to rephrase your question:
> 
> If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy.  So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error.  Question is, did I calculate that probability?
> 
> Answer is, I think so.  Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear.  Also dependent on the specific model of drive in question, and the graphs are typically not available.  So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure.  Already I'm simplifying the model beyond reality, but the simplification focuses on worst case, and treats every bit error as complete drive failure.  This is why I say "I think so," to answer your question.  
> 
> Then, I didn't want to embark on a mathematician's journey of derivatives and integrals over some non-linear failure rate graphs, so I linearized...  I forget now (it was like 4-6 years ago) but I would have likely seen that drives were unlikely to fail in the first 2 years, and about 50% likely to fail after 3 years, and nearly certain to fail after 5 years, so I would have likely modeled that as a linearly increasing probability of failure rate up to 4 years, where it's assumed 100% failure rate at 4 years.

This technique shows a good appreciation of the expected lifetime of components.
Some of the more sophisticated models use a Weibull distribution, and this works 
particularly well for computing devices. The problem for designers is that the Weibull
model parameters are not publically published by the vendors. You need some time
in the field to collect these, so it is impractical for the systems designers.

At the end of the day, we have two practical choices:
	1. Prepare for planned obsolescence and replacement of devices when the 
	   expected lifetime metric is reached. The best proxy for HDD expected lifetime
	   is the warranty period, and you'll often notice that enterprise drives have a better
	   spec than consumer drives -- you tend to get what you pay for.

	2. Measure your environment very carefully and take proactive action when the
	   system begins to display signs of age-related wear out. This is a good idea
	   in all cases, but the techniques are not widely adopted… yet.

> Yes, this modeling introduces inaccuracy, but that inaccuracy is in the noise.  Maybe in the first 2 years, I'm 25% off in my estimates to the positive, and after 4 years I'm 25% off in the negative, or something like that.  But when the results show 10^-17 probability for one configuration and 10^-19 probability for a different configuration, then the 25% error is irrelevant.  It's easy to see which configuration is more probable to fail, and it's also easy to see they're both well within acceptable limits for most purposes (especially if you have good backups.)

For reliability measurements, this is not a bad track record. There are lots of other, 
environmental and historical factors that impact real life. As an analogy, for humans,
early death tends to be dominated by accidents rather than chronic health conditions.
For example, children tend to die in automobile accidents, while octogenarians tend 
to die from heart attacks, organ failure, or cancer -- different failure modes as a function
of age.
 -- richard

>> Also, as for time to resilver, I'm guessing that depends largely on where
>> bottlenecks are (it has to read effectively all of the remaining disks in
>> the vdev either way, but can do so in parallel, so ideally it could be the
>> same speed), 
> 
> No.  The big factor for resilver time is (a) the number of operations that need to be performed, and (b) the number of operations per second.
> 
> If you have one big vdev making up a pool, then the number of operations to be performed is equal to the number of objects in the pool.  The number of operations per second is limited by the worst case random seek time for any device in the pool.  If you have an all-SSD pool, then it's equal to a single disk performance.  If you have an all-HDD pool, then with increasing number of devices in your vdev, you approach 50% of the IOPS of a single device.
> 
> If your pool is broken down into a bunch of smaller vdev's, Let's say N mirrors that are all 2-way.  Then the number of operations to resilver the degraded mirror is 1/N of the total objects in the pool.  And the number of operations per second is equal to the performance of a single disk.  So the resilver time in the big vdev raidz is 2N times longer than the resilver time for the mirror.
> 
> As you mentioned, other activity in the pool can further reduce the number of operations per second.  If you have N mirrors, then the probability of the other activity affecting the degraded mirror is 1/N.  Whereas, with a single big vdev, you guessed it, all other activity is guaranteed to affect the resilvering vdev.
> 
> 
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

-- 

ZFS storage and performance consulting at http://www.RichardElling.com