[OpenIndiana-discuss] RAM based devices as ZIL

Fri Sep 20 12:34:21 UTC 2013

> From: Steve Gonczi [mailto:gonczi at comcast.net]
> 
> For a fast (high ingest rate) system, 4G may not be enough.
> 
> If your Ram device space is not sufficient to hold all the in-flight Zill blocks,
> it will fail over, and the Zil will just redirect to your main data pool.
> 
> This is hard to notice, unless you have an idea of how much
> data should be flowing to your pool, as you monitor it with
> zpool iostat. Then you may notice the extra data being written to your
> data pool.
> 
> The calculation of how much Zil space you need is not straight forward,
> because blocks in general are freed in a delayed manner.
> 
> In other words, it is possible that the some Zil blocks are no longer needed
> because the transactions they represent already committed, but the blocks
> have not made it back to "available" status because of the conservative
> nature of the freed block recycling algo.
> 
> Rule of thumb, 3 to 5 txg-s worth of ingest, depending on who you ask.
> 
> Dedup and compression makes Slog sizing harder, because the Zil is neither
> compressed nor deduped. I would say if you dedup and / or compress,
> all bets are off.

The part that's missing from the above is discussion of size of ZIL vs the speed of the pool.  (And the nature of your writes.)  

If your pool, for example, is made of a single disk (or a single mirror) then the maximum speed you'll ever be able to write is 1 Gbit/sec, or 1Gbyte in 8 seconds.  Even if you have a pool of hundreds of disks, each SSD that you might consider using as ZIL has approximately the same limitation - approx 1 Gbyte in 8 seconds.  Maybe it's faster, call it 1 GB in 5 seconds.  Conveniently the same as the TXG flush interval.  So even with a 4x factor worked in as Steve mentioned above, you're still going to use only 4GB on the ZIL device.  (But as he said, writing a lot of highly compressible sync writes might be a factor.)

But even those numbers are unrealistic to attain.  I forget the name of the parameter, but there's an evil tuning parameter that says, sync mode writes above a certain size will not go to the log device - they will immediately go into the next TXG and trigger an immediate flush.  This means, the only sync mode writes that *actually* hit the log device are small in size, and your aggregate throughput to the device can never allow you to write anywhere near 4GB to the device.

Basically, this holds true unless your workload is specifically designed to violate it.  If you have a multithreaded infinite loop of really small random sync mode writes that are highly compressible...  Then you might hit the limit.

If your work load is serving NFS or iscsi ... which are among the most demanding sync write services you can provide ... the bottleneck is going to be your network.  Even with infiniband or 10GB ether, most of the actual workload is going to be larger blocks, and it will be extremely tough to get a sufficient level of small sync write blocks to add up to more than 4GB within the speed of the devices.