[OpenIndiana-discuss] Recommendations for fast storage

Tue Apr 16 23:55:58 UTC 2013

On 04/17/2013 12:08 AM, Richard Elling wrote:
> clarification below...
> 
> On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov <skiselkov.ml at gmail.com> wrote:
> 
>> On 04/16/2013 11:37 PM, Timothy Coalson wrote:
>>> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov <skiselkov.ml at gmail.com>wrote:
>>>
>>>> If you are IOPS constrained, then yes, raid-zn will be slower, simply
>>>> because any read needs to hit all data drives in the stripe. This is
>>>> even worse on writes if the raidz has bad geometry (number of data
>>>> drives isn't a power of 2).
>>>>
>>>
>>> Off topic slightly, but I have always wondered at this - what exactly
>>> causes non-power of 2 plus number of parities geometries to be slower, and
>>> by how much?  I tested for this effect with some consumer drives, comparing
>>> 8+2 and 10+2, and didn't see much of a penalty (though the only random test
>>> I did was read, our workload is highly sequential so it wasn't important).
> 
> This makes sense, even for more random workloads.
> 
>>
>> Because a non-power-of-2 number of drives causes a read-modify-write
>> sequence on every (almost) write. HDDs are block devices and they can
>> only ever write in increments of their sector size (512 bytes or
>> nowadays often 4096 bytes). Using your example above, you divide a 128k
>> block by 8, you get 8x16k updates - all nicely aligned on 512 byte
>> boundaries, so your drives can write that in one go. If you divide by
>> 10, you get an ugly 12.8k, which means if your drives are of the
>> 512-byte sector variety, they write 24x 512 sectors and then for the
>> last partial sector write, they first need to fetch the sector from the
>> patter, modify if in memory and then write it out again.
> 
> This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has been
> a few years, I did a bunch of tests and found no correlation between the number
> of disks in the set (within boundaries as described in the man page) and random
> performance for raidz. This is not the case for RAID-5/6 where pathologically
> bad performance is easy to create if you know the number of disks and stripe width.
>  -- richard

You are right, and I think I already know where I went wrong, though
I'll need to check raidz_map_alloc to confirm. If memory serves me
right, raidz actually splits the I/O up so that each stripe component is
simply length-aligned and padded out to complete a full sector
(otherwise the zio_vdev_child_io would fail in a block-alignment
assertion in zio_create here:

zio_create(zio_t *pio, spa_t *spa,...
{
  ..
  ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
  ..

I was probably misremembering the power-of-2 rule from a discussion
about 4k sector drives. There the amount of wasted space can be
significant, especially on small-block data, e.g. the default 8k
volblocksize not being able to scale beyond 2 data drives + parity.

Cheers,
--
Saso