[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Mon Oct 29 05:09:12 UTC 2012

On Oct 28, 2012, at 5:10 AM, Robin Axelsson <gu99roax at student.chalmers.se> wrote:
> On 2012-10-24 21:58, Timothy Coalson wrote:
>> On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson<
>> gu99roax at student.chalmers.se>  wrote:
>>> It would be interesting to know how you convert a raidz2 stripe to say a
>>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
>>> parity drive by converting it to a raidz3 pool.  I'm imagining that would
>>> be like creating a raidz1 pool on top of the leaf vdevs that constitutes
>>> the raidz2 pool and the new leaf vdev which results in an additional parity
>>> drive. It doesn't sound too difficult to do that. Actually, this way you
>>> could even get raidz4 or raidz5 pools. Question is though, how things would
>>> pan out performance wise, I would imagine that a 55 drive raidz25 pool is
>>> really taxing on the CPU.
>>> 
>> Multiple parity is more complicated than that, an additional xor device (a
>> la traditional raid4) would end up with zeros everywhere, and couldn't
>> reconstruct your data from an additional failure.  Look at "computing
>> parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
>> can extend to more than 3 parity blocks, it is unclear whether more than 3
>> will offer any serious additional benefits (using multiple raidz2 vdevs can
>> give you better IOPS than larger raidz3 vdevs, with little change in raw
>> space efficiency).  There are also combinatorial implications to multiple
>> bit errors in a single data chunk with high parity levels, but that is
>> somewhat unlikely.
> 
> XOR you say? I didn't know that raidz used xor for parity. I thought they used some kind of a Reed-Solomon implementation à la PAR2 on the block level to achieve "RAID like" functionality. It never was stated from what I could read in the documentation that the raid functionality was implemented like traditional hardware RAID. If xor is the case then I'm curious as to how they managed to pull off a raidz3 implementation with three disk redundancy.

The first parity is XOR (also a Reed-Solomon syndrome). The 2nd and 3rd 
parity are other syndromes.

Also, minor nit: there is no such thing as hardware RAID, there is only 
software RAID.
 -- richard

> 
> Maybe a good read into the zpool source code would help clarifying things...
> 
>> 
>> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
>>> no-brainer; you just remove one drive from the pool and force zpool to
>>> accept the new state as "normal".
>>> 
>> A degraded raidz2 vdev has to compute the missing block from parity on
>> nearly every read, this is not the normal state of raidz1.  Changing the
>> parity level, either up or down, has similar complications in the on-disk
>> structure.
>> 
>> But expanding a raidz pool with additional storage while preserving the
>>> parity structure sounds a little bit trickier. I don't think I have that
>>> knowledge to write a bpr rewriter although I'm reading Solaris Internals
>>> right now ;)
>> 
>> Unless raidz* did something radically different than raid5/6 (as in, not
>> having the parity blocks necessarily next to each other in the data chunk,
>> and having their positions recorded in the data chunk itself), the position
>> of the parity and data blocks would change.  The "always consistent on
>> disk" approach of ZFS adds additional problems to this, which probably make
>> it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
>> it has to find some free space every time it wants to update a chunk to the
>> new parity level.
>> 
>> 
>>>> What you describe here is known as unionfs in Linux, among others.
>>>> I think there were RFEs or otherwise expressed desires to make that
>>>> in Solaris and later illumos (I did campaign for that sometime ago),
>>>> but AFAIK this was not yet done by anyone.
>>>> 
>>>>  YES, UnionFS-like functionality is what I was talking about. It seems
>>> like it has been abandoned in favor of AuFS in the Linux and the BSD world.
>>> It seems to have functions that are a little overkill to use with zfs, such
>>> as copy-on-write. Perhaps a more simplistic implementation of it would be
>>> more suitable for zfs.
>>> 
>> You could create zfs filesystems for subfolders in your "dataset" from the
>> separate pools, and give them mountpoints that put them into the same
>> directory.  You would have to balance the data allocation between the pools
>> manually, though.
> 
> I know that works but I was talking about having files stored at different (hardware) locations and yet being in the same ... folder, I guess you are using MacOS :)
> 
>> 
>> Perhaps a similar functionality can be established through an abstraction
>>> layer behind network shares.
>>> 
>>> In Windows this functionality is called 'disk pooling', btw.
>> 
>> In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
>>  Do you actually expect a large portion of your disks to go offline
>> suddenly?  I don't see a good way to handle this (good meaning there are no
>> missing files under the expected error conditions) that gets you more than
>> 50% of your raw storage capacity (mirrors across the boundary of what you
>> expect to go down together).  I doubt I would like the outcome of having
>> some software make arbitrary decisions of what real filesystem to put each
>> file on, and then having one filesystem fail, so if you really expect this,
>> you may be happier keeping the two pools separate and deciding where to put
>> stuff yourself (since if you are expecting a set of disks to fail, I expect
>> you would have some idea as to which ones it would be, for instance an
>> external enclosure).
>> 
>> If, on the other hand, you don't expect your hardware to drop an entire set
>> of disks for no good reason, making them into one large storage pool and
>> putting your filesystem in it will share your data transparently across all
>> disks without needing to set anything else up.
>> 
>> Tim
> It seems that ZFS is good at protecting data but when things do happen to go south then ZFS seems to be pretty bad at handling the situation.

Eh? This comment makes no sense.

> The more hard drives that are used in a storage pool the higher the likelihood will be that something goes wrong.

yep, more stuff means more stuff to break.

> 
> While I agree that it is not reasonable to expect that all files will still be accessible if a large portion of the disks go offline at least it would be great if whatever happens to be in the remaining drives would still be accessible.
> 
> One way to achieve something along that direction would be to create some kind of a separation in the file system so that say two vdev configurations are technically independent but together constitutes a common unified storage location. It would be like cells in a ship; even if a few cells break and take in water, the ship won't sink because the other cells are intact.

We call that RAID :-)
 -- richard

-- 

ZFS storage and performance consulting at http://www.RichardElling.com