[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Sun Oct 28 12:42:48 UTC 2012

On 2012-10-25 12:21, Jim Klimov wrote:
> 2012-10-24 15:17, Robin Axelsson wrote:
>> On 2012-10-23 20:06, Jim Klimov wrote:
>>> 2012-10-23 19:53, Robin Axelsson wrote:
> ...
>> But if I do send/receive to the same pool I will need to have enough
>> free space in it to fit at least two copies of the dataset I want to
>> reallocate.
>
> Likewise with "reallocation" of files - though the unit of required
> space would be smaller.
>
>
>> It seems like what zfs is missing here is a good defrag tool.
>
> This was discussed several times, with the outcome being that with
> ZFS's data allocation policies, there is no one good defrag policy.
> The two most popular options are about storing the current "live"
> copy of a file contiguously (as opposed to its history of released
> blocks only referenced in snapshots) vs. storing pool blocks in
> ascending creation-TXG order (to arguably speed up scrubs and
> resilvers, which can consume a noticeable portion of performance
> doing random IO).
>
> Usual users mostly think that the first goal is good - however,
> if you add clones and dedup into the equation, it might never be
> possible to retain their benefits AND store all files contiguously.
>
> Also, as with other matters of moving blocks around in the allocation
> areas and transparently to other layers of the system (that is, on a
> live system that actively does I/O while you defrag data), there are
> some other problems that I'm not very qualified to speculate about,
> that are deemed to be solvable by the generic BPR.
>
> Still, I do think that many of the problems postponed until the time
> that BPR arrives, can be solved with different methods and limitations
> (such as off-line mangling of data on the pool) which might still be
> acceptable to some use-cases.
>
> All-in-all, the main intended usage of ZFS is on relatively powerful
> enterprise-class machines, where much of the needed data is cached
> on SSD or in huge RAM, so random HDD IO lags become less relevant.
> This situation is most noticeable with deduplication, which in ZFS
> implementation requires vast resources to basically function.
> With market prices going down over time, it is more likely to see
> home-NAS boxes tomorrow similarly spec'ed to enterprise servers of
> today, than to see the core software fundamentally revised and
> rearchitected for boxes of yesterday. After all, even in open-source
> world, developers need to eat and feed their families, so commercial
> applicability does matter and does influence the engineering designs
> and trade-offs.
>
>

Actually I have refrained from using dedup and compression as I somehow 
feel that the risks and penalties that come with them outweigh the 
benefits, at least for my purposes.

It should also be noted that in a lot of defrag software out there you 
have the option to choose different policies according to which you can 
reorganize the data on a disk. A tool that continuously monitors disk 
activity could give information as to what policy would be optimal for a 
particular disk configuration.

As I understand, fragmentation occurs when you remove files and rewrite 
files to a storage pool a large enough number of times. So if I have a 
fragmented storage pool and copy all files to a new empty storage pool, 
the new storage pool would not be fragmented. So I take it what you are 
saying is that the way the files are organized in the new storage pool 
is sometimes *worse* than the way they are organized in the old 
fragmented storage pool?

>
>> It would be interesting to know how you convert a raidz2 stripe to say a
>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an
>> extra parity drive by converting it to a raidz3 pool.  I'm imagining
>> that would be like creating a raidz1 pool on top of the leaf vdevs that
>> constitutes the raidz2 pool and the new leaf vdev which results in an
>> additional parity drive. It doesn't sound too difficult to do that.
>> Actually, this way you could even get raidz4 or raidz5 pools. Question
>> is though, how things would pan out performance wise, I would imagine
>> that a 55 drive raidz25 pool is really taxing on the CPU.
>>
>> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
>> no-brainer; you just remove one drive from the pool and force zpool to
>> accept the new state as "normal".
>>
>> But expanding a raidz pool with additional storage while preserving the
>> parity structure sounds a little bit trickier. I don't think I have that
>> knowledge to write a bpr rewriter although I'm reading Solaris Internals
>> right now ;)
>
> Read also the ZFS On-Disk Specification (the one I saw is somewhat
> outdated, being from 2006, but most concepts and data structures
> are the foundation - expected to remain in place and be expanded upon).
>
> In short, if I got that all right, the leaf components of a top-level
> VDEV are striped upon creation and declared as an allocation area with
> its ID and monotonous offsets of sectors (and further subdivided into
> a couple hundred SPAs to reduce seeking). For example, on a 5-disk
> array the offsets of the pooled sectors might look like this:
>
>   0  1  2  3  4
>   5  6  7  8  9
> ...
>
> (For the purposes of offset numbering, sector size is 512b - even on
> 4Kb-sectored disks; I am not sure how that's processed in the address
> math - likely the ashift value helps pick the specific disk's number).
>
> Then when a piece of data is saved by the OS (kernel metadata or
> userland userdata), this is logically combined into a block, processed
> for storage (compressed, etc.) and depending on redundancy level some
> sectors with parity data are prepended to each set of data sectors.
> For a raidz2 of 5 disks you'd have 2 parity (P, p, b) and up to 3 data
> sectors (D, d, k) as in the example below:
>
>   P0 P1 D0 D1 D2
>   P2 P3 D3 D4 p0
>   p1 d0 b0 b1 k0
>   k1 k2 ...
>
> ZFS allocates only as many sectors as are needed to store the
> redundancy and data for the block, so the data (and holes after
> removal of data) are not very predictably intermixed - as would
> be the case with traditional full-stripe RAID5/6. Still, this
> does allow recovery from the loss of N drives in a raidzN set
> while minimizing read-modify-write problems of traditional RAID.
>
> Also for each allocated block there is a separate metadata block
> which references it and contains the checksum of the block's data,
> and matching the data (and if needed, permutations of parity and
> data) to the checksum allows to determine that a block was read in
> without errors - and if there were errors, to possibly fix them.
>
> So, returning to your questions: again, the generic solution seems
> to somehow be the BPR, which would reallocate blocks in a different
> layout and in a manner transparent to other layers of the stack.
>
> Here are some possible complications:
> * By adding or removing disks in a raidzN set (regardless of changes
>   in redundancy) you effectively change the offset numbering, whose
>   validity relies on the number and size of component disks - if the
>   offset math breaks, you can no longer address and locate the blocks
>   saved on this disk set (there is no simple rule to change like
>   "divisible by N").
> * By changing the redundancy level (raidz2 to raidz3 or back, for
>   example), you'd need to rewrite all blocks to include or exclude
>   the extra parity sectors.
>
> A far-fetched "custom" solution to address this problem might be
> to expand the concept of replacement of faulted disks: here we'd
> virtually add a TLVDEV with new settings (raidzN, number of disks)
> to the pool and reuse some of the older disks. The old TLVDEV would
> be locked for writes, and its known free sectors, or perhaps ranges
> like SPAs, would become those available on the new TLVDEV.
>
> Likely, the combined TLVDEV should inherit the GUID of the old one,
> so that the DVA "naming" of the blocks would remain coherent in the
> pool, and so that the emptied old TLVDEV can ultimately be released
> and forgotten.
>
> This way the writes incoming to this combined TLVDEV would be saved
> in the new layout, while reads can be satisfied by both layouts
> with appropriate mapping added in the zfs layers.
>
> Still, to move the blocks from old TLVDEV layout onto the new one,
> as well as to move away extra blocks if the new TLVDEV is short on
> free space (or to clear up whole SPAs), some active mechanism would
> be needed - and many people would point their fingers at BPR again ;)
> (Though this could probably be a specialized scrub/resilver algo).
>
> I don't want to discourage the enthusiasts by saying something is
> impossible, nor to postpone the many awaited features until BPR
> appears (and AFAIK nobody is now working on it). But it is fair to
> have you know the possible problems in advance - technical and
> otherwise ;) I'd love to see someone stir up the community and
> complete the grand project ;)

I'll read he ZFS On-Disk Specification when I have finished reading 
Solaris Internals. Then we can talk further about this...

>
> HTH,
> //Jim Klimov
>
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
> .
>