[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Wed Oct 24 11:17:22 UTC 2012

On 2012-10-23 20:06, Jim Klimov wrote:
> 2012-10-23 19:53, Robin Axelsson wrote:
>> That sounds like a good point, unless you first scan for hard links and
>> avoid touching the files and their hard links in the shell script, I 
>> guess.
>
> I guess the idea about reading into memory and writing back into the 
> same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC"
> to be on the safer side) should take care of hardlinks, since the
> inode would stay the same. You should of course ensure that nobody
> uses the file in question (i.e. databases are down, etc). You can
> also keep track of "rebalanced" inode numbers to avoid processing
> hardlinked files more than once.
>
> ZFS send/recv should also take care of these things, and with
> sufficient space in the pool to ensure "even" writes (i.e. just
> after expansion with new VDEVs) it can be done within the pool if
> you don't have a spare one. Then you can ensure all needed "local"
> dataset properties are transfered, remove the old dataset and
> rename the new copy to its name (likewise for hierarchies of
> datasets).
>
But if I do send/receive to the same pool I will need to have enough 
free space in it to fit at least two copies of the dataset I want to 
reallocate.
>>
>> But I heard that a pool that is almost full have some performance
>> issues, especially when you try to delete files from that pool. But
>> maybe this becomes a non-issue once the pool is expanded by another 
>> vdev.
>
> This issue may remain - basically, when a pool is nearly full (YMMV,
> empirically over 80-90% for pools with many write-delete cycles,
> but there were reports of even 60% full being a problem), its block
> allocation may look like good cheese with many tiny holes. Walking
> the free space to find a hole big enough to write a new block takes
> time, hence the slowdown. When you expand the pool with a new vdev,
> the old full cheesy one does not go away, and writes that ZFS pipe
> line intended to put there would still lag (and may now time out and
> may get to another vdev, as someone else mentioned in this thread).
>

It seems like what zfs is missing here is a good defrag tool.

>
> To answer your other letters,
>
> > But if I have two raidz3 vdevs, is there any way to create an
> > isolation/separation between them so that if one of them fails, only 
> the
> > data that is stored within that vdev will be lost and all data that
> > happen to be stored in the other can be recovered? And yet let them 
> both
> > be accessible from the same path?
> >
> > The only thing that needs to be sorted out is where the files should go
> > when you write to that path and avoid splitting such that one half if
> > the file goes to one vdev and another goes to the other vdev. Maybe
> > there is some disk or i/o scheduler that can handle such operations?
>
> You can't do that. A pool is one whole (you can't also remove vdevs
> from it and you can't change or reduce raidzN groups' redundancy -
> may be that will change after the long-awaited BPR = block-pointer
> rewriter is implemented by some kind samaritan), and as soon as it
> is set up or expanded all writes go striped to all components and
> all top-level components are required not-failed to import the pool
> and use it.
>
It would be interesting to know how you convert a raidz2 stripe to say a 
raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an 
extra parity drive by converting it to a raidz3 pool.  I'm imagining 
that would be like creating a raidz1 pool on top of the leaf vdevs that 
constitutes the raidz2 pool and the new leaf vdev which results in an 
additional parity drive. It doesn't sound too difficult to do that. 
Actually, this way you could even get raidz4 or raidz5 pools. Question 
is though, how things would pan out performance wise, I would imagine 
that a 55 drive raidz25 pool is really taxing on the CPU.

Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a 
no-brainer; you just remove one drive from the pool and force zpool to 
accept the new state as "normal".

But expanding a raidz pool with additional storage while preserving the 
parity structure sounds a little bit trickier. I don't think I have that 
knowledge to write a bpr rewriter although I'm reading Solaris Internals 
right now ;)

> > I can't see how a dataset can span over several zpools as you usually
> > create it with mypool/datasetname (in the case of a file system
> > dataset). But I can see several datasets in one pool though (e.g.
> > mypool/dataset1, mypool/dataset2 ...). So the relationship I see is 
> pool
> > *onto* dataset.
>
> It can't. A dataset is contained in one pool. Many datasets can
> be contained in one pool and share the free space, dedup table and
> maybe some other resources. Datasets contained in different pools
> are unrelated.
>
> > But if I have two separate pools with separate names, say mypool1 and
> > mypool2 I could create a zfs file system dataset with the same name in
> > each of these pools and then give these two datasets the same
> > "mountpoint" property couldn't I? Then they would be forced to be
> > mounted to the same path.
>
> One at a time - yes. Both at once (in a useful manner) - no.
> If the mountpoint is not empty, zfs refuses to mount the dataset.
> Even if you force it to (using overlay mount -o), the last mounted
> dataset's filesystem will be all you'd see.
>
> You can however mount other datasets into logical "subdirectories"
> of the dataset you need to "expand", but those subs must be empty
> or nonexistant in your currently existing "parent" dataset. Also
> the new "children" are separate filesystems, so it is your quest
> to move data into them if you need to free up the existing dataset,
> and in particular remember that inodes of different filesystems
> are unrelated, so hardlinks will break for those files that would
> be forced to split from one inode in the source filesystem to
> several inodes (i.e. some pathnames in the source FS and some in
> the child) - like for any other FS boundary crossings.
>
>> * Can several datasets be mounted to the same mount point, i.e. can 
>> multiple "file system"-datasets be mounted so that they (the root of 
>> them) are all accessed from exactly the same (POSIX) path and 
>> subdirectories with coinciding names will be merged? The purpose of 
>> this would be to seamlessly expand storage capacity this way just 
>> like when adding vdevs to a pool.
>
> What you describe here is known as unionfs in Linux, among others.
> I think there were RFEs or otherwise expressed desires to make that
> in Solaris and later illumos (I did campaign for that sometime ago),
> but AFAIK this was not yet done by anyone.
>
YES, UnionFS-like functionality is what I was talking about. It seems 
like it has been abandoned in favor of AuFS in the Linux and the BSD 
world. It seems to have functions that are a little overkill to use with 
zfs, such as copy-on-write. Perhaps a more simplistic implementation of 
it would be more suitable for zfs.

Perhaps a similar functionality can be established through an 
abstraction layer behind network shares.

In Windows this functionality is called 'disk pooling', btw.
>> * If that's the case how will the data be distributed/allocated over 
>> the datasets if I copy a data file to that path?
> N/A.
>
>
> HTH,
> //Jim
>
> .
>