[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Tue Oct 23 18:06:21 UTC 2012

2012-10-23 19:53, Robin Axelsson wrote:
> That sounds like a good point, unless you first scan for hard links and
> avoid touching the files and their hard links in the shell script, I guess.

I guess the idea about reading into memory and writing back into the 
same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC"
to be on the safer side) should take care of hardlinks, since the
inode would stay the same. You should of course ensure that nobody
uses the file in question (i.e. databases are down, etc). You can
also keep track of "rebalanced" inode numbers to avoid processing
hardlinked files more than once.

ZFS send/recv should also take care of these things, and with
sufficient space in the pool to ensure "even" writes (i.e. just
after expansion with new VDEVs) it can be done within the pool if
you don't have a spare one. Then you can ensure all needed "local"
dataset properties are transfered, remove the old dataset and
rename the new copy to its name (likewise for hierarchies of
datasets).

>
> But I heard that a pool that is almost full have some performance
> issues, especially when you try to delete files from that pool. But
> maybe this becomes a non-issue once the pool is expanded by another vdev.

This issue may remain - basically, when a pool is nearly full (YMMV,
empirically over 80-90% for pools with many write-delete cycles,
but there were reports of even 60% full being a problem), its block
allocation may look like good cheese with many tiny holes. Walking
the free space to find a hole big enough to write a new block takes
time, hence the slowdown. When you expand the pool with a new vdev,
the old full cheesy one does not go away, and writes that ZFS pipe
line intended to put there would still lag (and may now time out and
may get to another vdev, as someone else mentioned in this thread).

To answer your other letters,

 > But if I have two raidz3 vdevs, is there any way to create an
 > isolation/separation between them so that if one of them fails, only the
 > data that is stored within that vdev will be lost and all data that
 > happen to be stored in the other can be recovered? And yet let them both
 > be accessible from the same path?
 >
 > The only thing that needs to be sorted out is where the files should go
 > when you write to that path and avoid splitting such that one half if
 > the file goes to one vdev and another goes to the other vdev. Maybe
 > there is some disk or i/o scheduler that can handle such operations?

You can't do that. A pool is one whole (you can't also remove vdevs
from it and you can't change or reduce raidzN groups' redundancy -
may be that will change after the long-awaited BPR = block-pointer
rewriter is implemented by some kind samaritan), and as soon as it
is set up or expanded all writes go striped to all components and
all top-level components are required not-failed to import the pool
and use it.

 > I can't see how a dataset can span over several zpools as you usually
 > create it with mypool/datasetname (in the case of a file system
 > dataset). But I can see several datasets in one pool though (e.g.
 > mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool
 > *onto* dataset.

It can't. A dataset is contained in one pool. Many datasets can
be contained in one pool and share the free space, dedup table and
maybe some other resources. Datasets contained in different pools
are unrelated.

 > But if I have two separate pools with separate names, say mypool1 and
 > mypool2 I could create a zfs file system dataset with the same name in
 > each of these pools and then give these two datasets the same
 > "mountpoint" property couldn't I? Then they would be forced to be
 > mounted to the same path.

One at a time - yes. Both at once (in a useful manner) - no.
If the mountpoint is not empty, zfs refuses to mount the dataset.
Even if you force it to (using overlay mount -o), the last mounted
dataset's filesystem will be all you'd see.

You can however mount other datasets into logical "subdirectories"
of the dataset you need to "expand", but those subs must be empty
or nonexistant in your currently existing "parent" dataset. Also
the new "children" are separate filesystems, so it is your quest
to move data into them if you need to free up the existing dataset,
and in particular remember that inodes of different filesystems
are unrelated, so hardlinks will break for those files that would
be forced to split from one inode in the source filesystem to
several inodes (i.e. some pathnames in the source FS and some in
the child) - like for any other FS boundary crossings.

> * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool.

What you describe here is known as unionfs in Linux, among others.
I think there were RFEs or otherwise expressed desires to make that
in Solaris and later illumos (I did campaign for that sometime ago),
but AFAIK this was not yet done by anyone.

> * If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path?
N/A.

HTH,
//Jim