[OpenIndiana-discuss] Recommendations for fast storage
Jay Heyl
jay at frelled.us
Wed Apr 17 19:25:36 UTC 2013
On Wed, Apr 17, 2013 at 5:38 AM, Edward Ned Harvey (openindiana) <
openindiana at nedharvey.com> wrote:
> > From: Sašo Kiselkov [mailto:skiselkov.ml at gmail.com]
> >
> > Raid-Z indeed does stripe data across all
> > leaf vdevs (minus parity) and does so by splitting the logical block up
> > into equally sized portions.
>
> Jay, there you have it. You asked why use mirrors, and you said you would
> use raidz2 or raidz3 unless cpu overhead is too much. I recommended using
> mirrors and avoiding raidzN, and here is the answer why.
>
> If you have 16 disks arranged in 8x mirrors, versus 10 disks in raidz2
> which stripes across 8 disks plus 2 parity disks, then the serial write of
> each configuration is about the same; that is, 8x the sustained write speed
> of a single device. But if you have two or more parallel sequential read
> threads, then the sequential read speed of the mirrors will be 16x while
> the raidz2 is only 8x. The mirror configuration can do 8x random write
> while the raidz2 is only 1x. And the mirror can do 16x random read while
> the raidz2 is only 1x.
>
It (finally) occurs to me that not all mirrors are created equal. I've been
assuming, and probably ignoring hints to the contrary, that what was being
compared here was a raid-z2 configuraton with a 2-way mirror composed of
two 8-disk vdevs. I now realize you're talking about 8 separate 2-disk
mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..."
I also realize that almost every discussion I've seen online concerning
mirrors proposes organizing the drives in the way I was thinking about it
(which is probably why I was thinking that way). I suppose this is
something different that zfs brings to the table when compared to more
conventional hardware raid.
>
> In the case you care about the least, they're equal. In the case you care
> about most, the mirror configuration is 16x faster.
>
> You also said the raidz2 will offer more protection against failure,
> because you can survive any two disk failures (but no more.) I would argue
> this is incorrect (I've done the probability analysis before). Mostly
> because the resilver time in the mirror configuration is 8x to 16x faster
> (there's 1/8 as much data to resilver, and IOPS is limited by a single
> disk, not the "worst" of several disks, which introduces another factor up
> to 2x, increasing the 8x as high as 16x), so the smaller resilver window
> means lower probability of "concurrent" failures on the critical vdev.
> We're talking about 12 hours versus 1 week, actual result of my machines
> in production. Also, while it's possible to fault the pool with only 2
> failures in the mirror configuration, the probability is against that
> happening. The first disk failure probability is 1/16 for each disk ...
> And then if you have a 2nd concurrent failure, there's a 14/15 probability
> that it occurs on a separately independent (safe) mirror. The 3rd
> concurrent failure 12/14 chance of being safe. The 4th concurrent failure
> 10/13 chance of being safe. Etc. The mirror configuration can probably
> withstand a higher number of failures, and also the resilver window for
> each failure is smaller. When you look at the total probability of pool
> failure, they were both like 10^-17 or something like that. In other
> words, we're splitting hairs but as long as we are, we might as well point
> out that they're both about the same.
>
This also starts to make a lot more sense. Confused the hell out of me the
first three times I read it. I'm going to have to ponder this a bit more as
my thinking has been heavily influenced by the more conventional mirror
arrangement.
More information about the OpenIndiana-discuss
mailing list