[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Wed Oct 24 19:58:32 UTC 2012

On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson <
gu99roax at student.chalmers.se> wrote:
>
> It would be interesting to know how you convert a raidz2 stripe to say a
> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
> parity drive by converting it to a raidz3 pool.  I'm imagining that would
> be like creating a raidz1 pool on top of the leaf vdevs that constitutes
> the raidz2 pool and the new leaf vdev which results in an additional parity
> drive. It doesn't sound too difficult to do that. Actually, this way you
> could even get raidz4 or raidz5 pools. Question is though, how things would
> pan out performance wise, I would imagine that a 55 drive raidz25 pool is
> really taxing on the CPU.
>

Multiple parity is more complicated than that, an additional xor device (a
la traditional raid4) would end up with zeros everywhere, and couldn't
reconstruct your data from an additional failure.  Look at "computing
parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
can extend to more than 3 parity blocks, it is unclear whether more than 3
will offer any serious additional benefits (using multiple raidz2 vdevs can
give you better IOPS than larger raidz3 vdevs, with little change in raw
space efficiency).  There are also combinatorial implications to multiple
bit errors in a single data chunk with high parity levels, but that is
somewhat unlikely.

Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
> no-brainer; you just remove one drive from the pool and force zpool to
> accept the new state as "normal".
>

A degraded raidz2 vdev has to compute the missing block from parity on
nearly every read, this is not the normal state of raidz1.  Changing the
parity level, either up or down, has similar complications in the on-disk
structure.

But expanding a raidz pool with additional storage while preserving the
> parity structure sounds a little bit trickier. I don't think I have that
> knowledge to write a bpr rewriter although I'm reading Solaris Internals
> right now ;)

Unless raidz* did something radically different than raid5/6 (as in, not
having the parity blocks necessarily next to each other in the data chunk,
and having their positions recorded in the data chunk itself), the position
of the parity and data blocks would change.  The "always consistent on
disk" approach of ZFS adds additional problems to this, which probably make
it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
it has to find some free space every time it wants to update a chunk to the
new parity level.

>> What you describe here is known as unionfs in Linux, among others.
>> I think there were RFEs or otherwise expressed desires to make that
>> in Solaris and later illumos (I did campaign for that sometime ago),
>> but AFAIK this was not yet done by anyone.
>>
>>  YES, UnionFS-like functionality is what I was talking about. It seems
> like it has been abandoned in favor of AuFS in the Linux and the BSD world.
> It seems to have functions that are a little overkill to use with zfs, such
> as copy-on-write. Perhaps a more simplistic implementation of it would be
> more suitable for zfs.
>

You could create zfs filesystems for subfolders in your "dataset" from the
separate pools, and give them mountpoints that put them into the same
directory.  You would have to balance the data allocation between the pools
manually, though.

Perhaps a similar functionality can be established through an abstraction
> layer behind network shares.
>
> In Windows this functionality is called 'disk pooling', btw.

In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
 Do you actually expect a large portion of your disks to go offline
suddenly?  I don't see a good way to handle this (good meaning there are no
missing files under the expected error conditions) that gets you more than
50% of your raw storage capacity (mirrors across the boundary of what you
expect to go down together).  I doubt I would like the outcome of having
some software make arbitrary decisions of what real filesystem to put each
file on, and then having one filesystem fail, so if you really expect this,
you may be happier keeping the two pools separate and deciding where to put
stuff yourself (since if you are expecting a set of disks to fail, I expect
you would have some idea as to which ones it would be, for instance an
external enclosure).

If, on the other hand, you don't expect your hardware to drop an entire set
of disks for no good reason, making them into one large storage pool and
putting your filesystem in it will share your data transparently across all
disks without needing to set anything else up.

Tim