[OpenIndiana-discuss] Recommendations for fast storage

Thu Apr 18 12:17:47 UTC 2013

> From: Timothy Coalson [mailto:tsc5yc at mst.edu]
> 
> Did you also compare the probability of bit errors causing data loss
> without a complete pool failure?  2-way mirrors, when one device
> completely
> dies, have no redundancy on that data, and the copy that remains must be
> perfect or some data will be lost.  

I had to think about this comment for a little while to understand what you were saying, but I think I got it.  I'm going to rephrase your question:

If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy.  So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error.  Question is, did I calculate that probability?

Answer is, I think so.  Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear.  Also dependent on the specific model of drive in question, and the graphs are typically not available.  So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure.  Already I'm simplifying the model beyond reality, but the simplification focuses on worst case, and treats every bit error as complete drive failure.  This is why I say "I think so," to answer your question.  

Then, I didn't want to embark on a mathematician's journey of derivatives and integrals over some non-linear failure rate graphs, so I linearized...  I forget now (it was like 4-6 years ago) but I would have likely seen that drives were unlikely to fail in the first 2 years, and about 50% likely to fail after 3 years, and nearly certain to fail after 5 years, so I would have likely modeled that as a linearly increasing probability of failure rate up to 4 years, where it's assumed 100% failure rate at 4 years.

Yes, this modeling introduces inaccuracy, but that inaccuracy is in the noise.  Maybe in the first 2 years, I'm 25% off in my estimates to the positive, and after 4 years I'm 25% off in the negative, or something like that.  But when the results show 10^-17 probability for one configuration and 10^-19 probability for a different configuration, then the 25% error is irrelevant.  It's easy to see which configuration is more probable to fail, and it's also easy to see they're both well within acceptable limits for most purposes (especially if you have good backups.)

> Also, as for time to resilver, I'm guessing that depends largely on where
> bottlenecks are (it has to read effectively all of the remaining disks in
> the vdev either way, but can do so in parallel, so ideally it could be the
> same speed), 

No.  The big factor for resilver time is (a) the number of operations that need to be performed, and (b) the number of operations per second.

If you have one big vdev making up a pool, then the number of operations to be performed is equal to the number of objects in the pool.  The number of operations per second is limited by the worst case random seek time for any device in the pool.  If you have an all-SSD pool, then it's equal to a single disk performance.  If you have an all-HDD pool, then with increasing number of devices in your vdev, you approach 50% of the IOPS of a single device.

If your pool is broken down into a bunch of smaller vdev's, Let's say N mirrors that are all 2-way.  Then the number of operations to resilver the degraded mirror is 1/N of the total objects in the pool.  And the number of operations per second is equal to the performance of a single disk.  So the resilver time in the big vdev raidz is 2N times longer than the resilver time for the mirror.

As you mentioned, other activity in the pool can further reduce the number of operations per second.  If you have N mirrors, then the probability of the other activity affecting the degraded mirror is 1/N.  Whereas, with a single big vdev, you guessed it, all other activity is guaranteed to affect the resilvering vdev.