[OpenIndiana-discuss] Recommendations for fast storage

Tue Apr 16 22:05:43 UTC 2013

On Tue, Apr 16, 2013 at 4:44 PM, Sašo Kiselkov <skiselkov.ml at gmail.com>wrote:

> On 04/16/2013 11:37 PM, Timothy Coalson wrote:
> > On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov <skiselkov.ml at gmail.com
> >wrote:
> >
> >> If you are IOPS constrained, then yes, raid-zn will be slower, simply
> >> because any read needs to hit all data drives in the stripe. This is
> >> even worse on writes if the raidz has bad geometry (number of data
> >> drives isn't a power of 2).
> >>
> >
> > Off topic slightly, but I have always wondered at this - what exactly
> > causes non-power of 2 plus number of parities geometries to be slower,
> and
> > by how much?  I tested for this effect with some consumer drives,
> comparing
> > 8+2 and 10+2, and didn't see much of a penalty (though the only random
> test
> > I did was read, our workload is highly sequential so it wasn't
> important).
>
> Because a non-power-of-2 number of drives causes a read-modify-write
> sequence on every (almost) write. HDDs are block devices and they can
> only ever write in increments of their sector size (512 bytes or
> nowadays often 4096 bytes). Using your example above, you divide a 128k
> block by 8, you get 8x16k updates - all nicely aligned on 512 byte
> boundaries, so your drives can write that in one go. If you divide by
> 10, you get an ugly 12.8k, which means if your drives are of the
> 512-byte sector variety, they write 24x 512 sectors and then for the
> last partial sector write, they first need to fetch the sector from the
> patter, modify if in memory and then write it out again.
>
> I said "almost" every write is affected, but this largely depends on
> your workload. If your writes are large async writes, then this RMW
> cycle only happens at the end of the transaction commit (simplifying a
> bit, but you get the idea), which is pretty small. However, if you are
> doing many small updates in different locations (e.g. writing the ZIL),
> this can significantly amplify the load.

Okay, I get the carryover of partial stripe causing problems, that makes
sense, and at least has implications on space efficiency given that ZFS
mainly uses power of 2 block sizes.  However, I was not under the
impression that ZFS ever uses partial sectors, that instead it uses fewer
devices in the final stripe, ie, it would be split 10+2, 10+2...6+2.  If
what you say is true, I'm not sure how ZFS both manages to address halfway
through a sector (if it must keep that old partial sector, it must be used
somewhere, yes?), and yet has problems with changing sector sizes (the
infamous ashift).  Are you perhaps thinking of block device style software
raid, where you need to ensure that even non-useful bits have correct
parity computed?

Tim