[OpenIndiana-discuss] Inefficient zvol space usage on 4k drives

Dan Vatca dan.vatca at gmail.com
Thu Aug 8 13:30:25 UTC 2013


Hi,

I've done some tests (some months ago) similar to the ones you intend to make, and indeed the space inefficiency clearly shows. The test data is not very "academic", but proves the point you are discussing.
The data also shows that the current refreservation calculations are not accurate in the case of RAIDZ, as they do not take the physical block size (ashift) in consideration. For mirrors the behaviour is different.
I consider this a bug, as a there are conditions when a "thick provisioned" volume will need to "refer" more space than calculated refreservation - thus not being really "thick" after all. The refreservation calculation is intended to be an upper bound to referred space (as I am able to understand).
One interesting additional point is that the same space inefficiency also shows in the case of ashift=9, and the inefficiency is greater when volblocksize is less than the physical block size (somehow expectedly).

PS: Sorry for the excel format, but it was the best way to capture data (and this is what I have at hand). If you need a different format, I will convert that for you.

-------------- next part --------------


On 8 Aug 2013, at 04:21, Steve Gonczi <gonczi at comcast.net> wrote:

> Hi Jim, 
> 
> This looks to me more like a rounding-up problem, esp. looking at the 
> bug report quoted. The waste factor increases as the block size goes 
> down. Kind of looks like it fits the ratio of the blocks nominal size, vs 
> its minimal on-disk foot print. 
> 
> For example, compressed blocks are variable size. 
> If a block compresses to some small but non-zero size, it would take 
> up the size of the smallest on-disk allocation unit. For an 8K block 
> block, the smallest non-zero allocation could be 4K ( vs. 512 bytes) 
> 
> A similar thing would happen to small files, taking up less than a single block's 
> worth of bytes. Zfs alters the block size for these, to closely match the 
> actual bytes stored. 
> A one byte file would take up merely a single sector. For small files, 
> 512 byes vs. 4K minimum size can make a big difference. 
> 
> If most of the blocks are compressed, or there are a lot of small files, 
> the 8k vs 512 or the 8k vs 4k ratio pretty much predicts doubling of 
> the on-disk footprint at 8k block size. 
> 
> I do not see how the sector size could cause a similarly significant 
> increase in the on-disk footprint by making metadata storage inefficient. 
> 
> I presume when you are talking about metadata, you mean 
> the interior nodes (level > 0) of files. 
> 
> If a file is <= 3 blocks in size, it will not have any interior nodes. 
> Otherwise, the nodes are allocated one page at a time, as many as needed. 
> Metadata pages currently contain 128 block pointer structs (128*128 bytes == 16K) 
> This interior node page size is independent of the file system's user-changeable 
> block size. 
> I do not believe that these pages are variable size. 
> So a rough guesstimate would be : one 16K metadata page for every 128 blocks 
> in the file. 
> (Technically, there could be multiple levels of interior node pages, but the 128x fanout 
> is so aggressive that you can neglect those for an order-of-magnitude 
> rough guess) 
> 
> 
> On the average, metadata takes up less than 2% of the space needed by user payload. 
> 
> I am planning on playing with 4K sectors to try and repeat the experiment mentioned, 
> I am curious what are the performance and space usage implications when 
> the file size and compression are taken into consideration. 
> 
> 
> Steve 
> 
> ----- Original Message -----
> Yes, I've had similar results on my rig and complained some time ago... 
> yet the ZFS world moves forward with desiring ashift=12 as the default 
> (and it may be inevitable ultimately). I think the main problem is that 
> small userdata blocks involve a larger portion of metadata, which may 
> come in small blocks which don't fully cover a sector (supposedly they 
> should be aggregated into up-to-16k clusters, but evidently are not 
> always so. 
> 
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss



More information about the OpenIndiana-discuss mailing list