[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Jim Klimov jimklimov at cos.ru
Thu Oct 25 10:21:04 UTC 2012

2012-10-24 15:17, Robin Axelsson wrote:
> On 2012-10-23 20:06, Jim Klimov wrote:
>> 2012-10-23 19:53, Robin Axelsson wrote:
> But if I do send/receive to the same pool I will need to have enough
> free space in it to fit at least two copies of the dataset I want to
> reallocate.

Likewise with "reallocation" of files - though the unit of required
space would be smaller.

> It seems like what zfs is missing here is a good defrag tool.

This was discussed several times, with the outcome being that with
ZFS's data allocation policies, there is no one good defrag policy.
The two most popular options are about storing the current "live"
copy of a file contiguously (as opposed to its history of released
blocks only referenced in snapshots) vs. storing pool blocks in
ascending creation-TXG order (to arguably speed up scrubs and
resilvers, which can consume a noticeable portion of performance
doing random IO).

Usual users mostly think that the first goal is good - however,
if you add clones and dedup into the equation, it might never be
possible to retain their benefits AND store all files contiguously.

Also, as with other matters of moving blocks around in the allocation
areas and transparently to other layers of the system (that is, on a
live system that actively does I/O while you defrag data), there are
some other problems that I'm not very qualified to speculate about,
that are deemed to be solvable by the generic BPR.

Still, I do think that many of the problems postponed until the time
that BPR arrives, can be solved with different methods and limitations
(such as off-line mangling of data on the pool) which might still be
acceptable to some use-cases.

All-in-all, the main intended usage of ZFS is on relatively powerful
enterprise-class machines, where much of the needed data is cached
on SSD or in huge RAM, so random HDD IO lags become less relevant.
This situation is most noticeable with deduplication, which in ZFS
implementation requires vast resources to basically function.
With market prices going down over time, it is more likely to see
home-NAS boxes tomorrow similarly spec'ed to enterprise servers of
today, than to see the core software fundamentally revised and
rearchitected for boxes of yesterday. After all, even in open-source
world, developers need to eat and feed their families, so commercial
applicability does matter and does influence the engineering designs
and trade-offs.

> It would be interesting to know how you convert a raidz2 stripe to say a
> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an
> extra parity drive by converting it to a raidz3 pool.  I'm imagining
> that would be like creating a raidz1 pool on top of the leaf vdevs that
> constitutes the raidz2 pool and the new leaf vdev which results in an
> additional parity drive. It doesn't sound too difficult to do that.
> Actually, this way you could even get raidz4 or raidz5 pools. Question
> is though, how things would pan out performance wise, I would imagine
> that a 55 drive raidz25 pool is really taxing on the CPU.
> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
> no-brainer; you just remove one drive from the pool and force zpool to
> accept the new state as "normal".
> But expanding a raidz pool with additional storage while preserving the
> parity structure sounds a little bit trickier. I don't think I have that
> knowledge to write a bpr rewriter although I'm reading Solaris Internals
> right now ;)

Read also the ZFS On-Disk Specification (the one I saw is somewhat
outdated, being from 2006, but most concepts and data structures
are the foundation - expected to remain in place and be expanded upon).

In short, if I got that all right, the leaf components of a top-level
VDEV are striped upon creation and declared as an allocation area with
its ID and monotonous offsets of sectors (and further subdivided into
a couple hundred SPAs to reduce seeking). For example, on a 5-disk
array the offsets of the pooled sectors might look like this:

   0  1  2  3  4
   5  6  7  8  9

(For the purposes of offset numbering, sector size is 512b - even on
4Kb-sectored disks; I am not sure how that's processed in the address
math - likely the ashift value helps pick the specific disk's number).

Then when a piece of data is saved by the OS (kernel metadata or
userland userdata), this is logically combined into a block, processed
for storage (compressed, etc.) and depending on redundancy level some
sectors with parity data are prepended to each set of data sectors.
For a raidz2 of 5 disks you'd have 2 parity (P, p, b) and up to 3 data
sectors (D, d, k) as in the example below:

   P0 P1 D0 D1 D2
   P2 P3 D3 D4 p0
   p1 d0 b0 b1 k0
   k1 k2 ...

ZFS allocates only as many sectors as are needed to store the
redundancy and data for the block, so the data (and holes after
removal of data) are not very predictably intermixed - as would
be the case with traditional full-stripe RAID5/6. Still, this
does allow recovery from the loss of N drives in a raidzN set
while minimizing read-modify-write problems of traditional RAID.

Also for each allocated block there is a separate metadata block
which references it and contains the checksum of the block's data,
and matching the data (and if needed, permutations of parity and
data) to the checksum allows to determine that a block was read in
without errors - and if there were errors, to possibly fix them.

So, returning to your questions: again, the generic solution seems
to somehow be the BPR, which would reallocate blocks in a different
layout and in a manner transparent to other layers of the stack.

Here are some possible complications:
* By adding or removing disks in a raidzN set (regardless of changes
   in redundancy) you effectively change the offset numbering, whose
   validity relies on the number and size of component disks - if the
   offset math breaks, you can no longer address and locate the blocks
   saved on this disk set (there is no simple rule to change like
   "divisible by N").
* By changing the redundancy level (raidz2 to raidz3 or back, for
   example), you'd need to rewrite all blocks to include or exclude
   the extra parity sectors.

A far-fetched "custom" solution to address this problem might be
to expand the concept of replacement of faulted disks: here we'd
virtually add a TLVDEV with new settings (raidzN, number of disks)
to the pool and reuse some of the older disks. The old TLVDEV would
be locked for writes, and its known free sectors, or perhaps ranges
like SPAs, would become those available on the new TLVDEV.

Likely, the combined TLVDEV should inherit the GUID of the old one,
so that the DVA "naming" of the blocks would remain coherent in the
pool, and so that the emptied old TLVDEV can ultimately be released
and forgotten.

This way the writes incoming to this combined TLVDEV would be saved
in the new layout, while reads can be satisfied by both layouts
with appropriate mapping added in the zfs layers.

Still, to move the blocks from old TLVDEV layout onto the new one,
as well as to move away extra blocks if the new TLVDEV is short on
free space (or to clear up whole SPAs), some active mechanism would
be needed - and many people would point their fingers at BPR again ;)
(Though this could probably be a specialized scrub/resilver algo).

I don't want to discourage the enthusiasts by saying something is
impossible, nor to postpone the many awaited features until BPR
appears (and AFAIK nobody is now working on it). But it is fair to
have you know the possible problems in advance - technical and
otherwise ;) I'd love to see someone stir up the community and
complete the grand project ;)

//Jim Klimov

More information about the OpenIndiana-discuss mailing list