[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Sun Oct 28 12:10:02 UTC 2012

On 2012-10-24 21:58, Timothy Coalson wrote:
> On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson<
> gu99roax at student.chalmers.se>  wrote:
>> It would be interesting to know how you convert a raidz2 stripe to say a
>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
>> parity drive by converting it to a raidz3 pool.  I'm imagining that would
>> be like creating a raidz1 pool on top of the leaf vdevs that constitutes
>> the raidz2 pool and the new leaf vdev which results in an additional parity
>> drive. It doesn't sound too difficult to do that. Actually, this way you
>> could even get raidz4 or raidz5 pools. Question is though, how things would
>> pan out performance wise, I would imagine that a 55 drive raidz25 pool is
>> really taxing on the CPU.
>>
> Multiple parity is more complicated than that, an additional xor device (a
> la traditional raid4) would end up with zeros everywhere, and couldn't
> reconstruct your data from an additional failure.  Look at "computing
> parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
> can extend to more than 3 parity blocks, it is unclear whether more than 3
> will offer any serious additional benefits (using multiple raidz2 vdevs can
> give you better IOPS than larger raidz3 vdevs, with little change in raw
> space efficiency).  There are also combinatorial implications to multiple
> bit errors in a single data chunk with high parity levels, but that is
> somewhat unlikely.

XOR you say? I didn't know that raidz used xor for parity. I thought 
they used some kind of a Reed-Solomon implementation à la PAR2 on the 
block level to achieve "RAID like" functionality. It never was stated 
from what I could read in the documentation that the raid functionality 
was implemented like traditional hardware RAID. If xor is the case then 
I'm curious as to how they managed to pull off a raidz3 implementation 
with three disk redundancy.

Maybe a good read into the zpool source code would help clarifying things...

>
> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
>> no-brainer; you just remove one drive from the pool and force zpool to
>> accept the new state as "normal".
>>
> A degraded raidz2 vdev has to compute the missing block from parity on
> nearly every read, this is not the normal state of raidz1.  Changing the
> parity level, either up or down, has similar complications in the on-disk
> structure.
>
> But expanding a raidz pool with additional storage while preserving the
>> parity structure sounds a little bit trickier. I don't think I have that
>> knowledge to write a bpr rewriter although I'm reading Solaris Internals
>> right now ;)
>
> Unless raidz* did something radically different than raid5/6 (as in, not
> having the parity blocks necessarily next to each other in the data chunk,
> and having their positions recorded in the data chunk itself), the position
> of the parity and data blocks would change.  The "always consistent on
> disk" approach of ZFS adds additional problems to this, which probably make
> it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
> it has to find some free space every time it wants to update a chunk to the
> new parity level.
>
>
>>> What you describe here is known as unionfs in Linux, among others.
>>> I think there were RFEs or otherwise expressed desires to make that
>>> in Solaris and later illumos (I did campaign for that sometime ago),
>>> but AFAIK this was not yet done by anyone.
>>>
>>>   YES, UnionFS-like functionality is what I was talking about. It seems
>> like it has been abandoned in favor of AuFS in the Linux and the BSD world.
>> It seems to have functions that are a little overkill to use with zfs, such
>> as copy-on-write. Perhaps a more simplistic implementation of it would be
>> more suitable for zfs.
>>
> You could create zfs filesystems for subfolders in your "dataset" from the
> separate pools, and give them mountpoints that put them into the same
> directory.  You would have to balance the data allocation between the pools
> manually, though.

I know that works but I was talking about having files stored at 
different (hardware) locations and yet being in the same ... folder, I 
guess you are using MacOS :)

>
> Perhaps a similar functionality can be established through an abstraction
>> layer behind network shares.
>>
>> In Windows this functionality is called 'disk pooling', btw.
>
> In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
>   Do you actually expect a large portion of your disks to go offline
> suddenly?  I don't see a good way to handle this (good meaning there are no
> missing files under the expected error conditions) that gets you more than
> 50% of your raw storage capacity (mirrors across the boundary of what you
> expect to go down together).  I doubt I would like the outcome of having
> some software make arbitrary decisions of what real filesystem to put each
> file on, and then having one filesystem fail, so if you really expect this,
> you may be happier keeping the two pools separate and deciding where to put
> stuff yourself (since if you are expecting a set of disks to fail, I expect
> you would have some idea as to which ones it would be, for instance an
> external enclosure).
>
> If, on the other hand, you don't expect your hardware to drop an entire set
> of disks for no good reason, making them into one large storage pool and
> putting your filesystem in it will share your data transparently across all
> disks without needing to set anything else up.
>
> Tim
It seems that ZFS is good at protecting data but when things do happen 
to go south then ZFS seems to be pretty bad at handling the situation. 
The more hard drives that are used in a storage pool the higher the 
likelihood will be that something goes wrong.

While I agree that it is not reasonable to expect that all files will 
still be accessible if a large portion of the disks go offline at least 
it would be great if whatever happens to be in the remaining drives 
would still be accessible.

One way to achieve something along that direction would be to create 
some kind of a separation in the file system so that say two vdev 
configurations are technically independent but together constitutes a 
common unified storage location. It would be like cells in a ship; even 
if a few cells break and take in water, the ship won't sink because the 
other cells are intact.

> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
> .
>