[OpenIndiana-discuss] Recommendations for fast storage
Sašo Kiselkov
skiselkov.ml at gmail.com
Tue Apr 16 23:55:58 UTC 2013
On 04/17/2013 12:08 AM, Richard Elling wrote:
> clarification below...
>
> On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov <skiselkov.ml at gmail.com> wrote:
>
>> On 04/16/2013 11:37 PM, Timothy Coalson wrote:
>>> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov <skiselkov.ml at gmail.com>wrote:
>>>
>>>> If you are IOPS constrained, then yes, raid-zn will be slower, simply
>>>> because any read needs to hit all data drives in the stripe. This is
>>>> even worse on writes if the raidz has bad geometry (number of data
>>>> drives isn't a power of 2).
>>>>
>>>
>>> Off topic slightly, but I have always wondered at this - what exactly
>>> causes non-power of 2 plus number of parities geometries to be slower, and
>>> by how much? I tested for this effect with some consumer drives, comparing
>>> 8+2 and 10+2, and didn't see much of a penalty (though the only random test
>>> I did was read, our workload is highly sequential so it wasn't important).
>
> This makes sense, even for more random workloads.
>
>>
>> Because a non-power-of-2 number of drives causes a read-modify-write
>> sequence on every (almost) write. HDDs are block devices and they can
>> only ever write in increments of their sector size (512 bytes or
>> nowadays often 4096 bytes). Using your example above, you divide a 128k
>> block by 8, you get 8x16k updates - all nicely aligned on 512 byte
>> boundaries, so your drives can write that in one go. If you divide by
>> 10, you get an ugly 12.8k, which means if your drives are of the
>> 512-byte sector variety, they write 24x 512 sectors and then for the
>> last partial sector write, they first need to fetch the sector from the
>> patter, modify if in memory and then write it out again.
>
> This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has been
> a few years, I did a bunch of tests and found no correlation between the number
> of disks in the set (within boundaries as described in the man page) and random
> performance for raidz. This is not the case for RAID-5/6 where pathologically
> bad performance is easy to create if you know the number of disks and stripe width.
> -- richard
You are right, and I think I already know where I went wrong, though
I'll need to check raidz_map_alloc to confirm. If memory serves me
right, raidz actually splits the I/O up so that each stripe component is
simply length-aligned and padded out to complete a full sector
(otherwise the zio_vdev_child_io would fail in a block-alignment
assertion in zio_create here:
zio_create(zio_t *pio, spa_t *spa,...
{
..
ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
..
I was probably misremembering the power-of-2 rule from a discussion
about 4k sector drives. There the amount of wasted space can be
significant, especially on small-block data, e.g. the default 8k
volblocksize not being able to scale beyond 2 data drives + parity.
Cheers,
--
Saso
More information about the OpenIndiana-discuss
mailing list