[OpenIndiana-discuss] Fwd: poor zfs compression ratio

Wed Nov 2 20:25:47 UTC 2011

> From: Dan Swartzendruber [mailto:dswartz at druber.com]
> 
> Not to nitpick, but dedup isn't really compression in one significant
> respect.  e.g. you can have 3 copies of the same data chunk and it is only
> stored as one (effectively a compression ratio of 4:1), even if the data
in
> question is uncompressible (due to already being compressed.)

Try this:
for f in a b c d ; do dd if=/dev/urandom of=$f bs=1k count=1 ; done
Now you have four files containing random data (uncompressible.)

for ((i=0; i<100; i++)) ; do cat a b c d >> final ; done
Now you've taken uncompressible data, and repeated it a bunch of times.  The
result is compressible.
gzip final

I cannot say specifically if gzip will handle this compression, because I
didn't bother actually running those commands  on my system.  It all depends
on whether or not my blocksize 1k is larger or smaller than the scope of the
compression tables.  But I can say in principle it's compressible.  Some
algorithms (lzw for example) use a lookup table, and if repeated
uncompressible patterns are detected, then the whole block of uncompressible
data gets stored into the table, and only a table index needs to be stored
in the compressed data stream.

lookup table... repeated data... just store the data once (or a small number
of times) and reference it multiple times...  Sound familiar?  Like what
they do in the DDT?

If you look at how DEFLATE works (part of zlib and lzw and whatever else)
one of the techniques is duplicate string elimination.  De-duplication, one
might say.

And so on.