[OpenIndiana-discuss] ZFS; what the manuals don't say ...

Tue Oct 23 15:23:20 UTC 2012

On 10/23/2012 11:08 AM, Robin Axelsson wrote:
> On 2012-10-23 15:41, Doug Hughes wrote:
>> On 10/23/2012 8:29 AM, Robin Axelsson wrote:
>>> Hi,
>>> I've been using zfs for a while but still there are some questions 
>>> that have remained unanswered even after reading the documentation 
>>> so I thought I would ask them here.
>>>
>>> I have learned that zfs datasets can be expanded by adding vdevs. 
>>> Say that you have created say a raidz3 pool named "mypool" with the 
>>> command
>>> # zpool create mypool raidz3 disk1 disk2 disk3 ... disk8
>>>
>>> you can expand the capacity by adding vdevs to it through the command
>>>
>>> # zpool add mypool raidz3 disk9 disk10 ... disk16
>>>
>>> The vdev that is added doesn't need to have the same raid/mirror 
>>> configuration or disk geometry, if I understand correctly. It will 
>>> merely be dynamically concatenated with the old storage pool. The 
>>> documentations says that it will be "striped" but it is not so clear 
>>> what that means if data is already stored in the old vdevs of the pool.
>>>
>>> Unanswered questions:
>>>
>>> * What determines _where_ the data will be stored on a such a pool? 
>>> Will it fill up the old vdev(s) before moving on to the new one or 
>>> will the data be distributed evenly?
>>> * If the old pool is almost full, an even distribution will be 
>>> impossible, unless zpool rearranges/relocates data upon adding the 
>>> vdev. Is that what will happen upon adding a vdev?
>>> * Can the individual vdevs be read independently/separately? If say 
>>> the newly added vdev faults, will the entire pool be unreadable or 
>>> will I still be able to access the old data? What if I took a 
>>> snapshot before adding the new vdev?
>>>
>>> * Can several datasets be mounted to the same mount point, i.e. can 
>>> multiple "file system"-datasets be mounted so that they (the root of 
>>> them) are all accessed from exactly the same (POSIX) path and 
>>> subdirectories with coinciding names will be merged? The purpose of 
>>> this would be to seamlessly expand storage capacity this way just 
>>> like when adding vdevs to a pool.
>>> * If that's the case how will the data be distributed/allocated over 
>>> the datasets if I copy a data file to that path?
>>>
>>> Kind regards
>>> Robin.
>>
>> *) yes, you can dynamically add more disks and zfs will just start 
>> using them.
>> *) zfs stripes across all vdevs evenly, as it can.
>> *) as your old vdev gets full, zfs will only allocate blocks to the 
>> newer, less full vdev
>> *) since it's a stripe across vdevs (and they should all be raidz2 or 
>> better!) if one vdev fails, your filesystem will be unavailable. They 
>> are not independent unless you put them in a separate pool.
>> *) you cannot have overlapping /mixed filesystems at exactly the same 
>> place, however it is perfectly possible to have e.g. /export be on 
>> rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be 
>> on zpool2.
>>
>> The unasked question is "If I wanted the vdevs to be equally 
>> balanced, could I?". The answers is a qualified yes. What you would 
>> need to do is reopen every single file, buffer it to memory, then 
>> write every block out again. We did this operation once. It means 
>> that all vdevs will roughly have the same block allocation when you 
>> are done.
>>
>>
> Do you happen to know how that's done in OI? Otherwise I would have to 
> move each file one by one to a disk location outside the dataset and 
> then move it back or zfs send the dataset to another pool of at least 
> equal size and then zfs receive it back to the expanded pool.

you don't have to move it, you just have to open, read it into memory, 
seek back to the beginning, and write it out again. Rewriting those 
blocks will take care of it since ZFS is copy-on-write. You will need to 
be wary of your snapshots during this process since all files will be 
rewritten and you'll double your space consumption.

(basically a perl, python, or other similar script could do this)