[OpenIndiana-discuss] What happens when a ZIL drive dies?

Mon Jun 4 16:03:46 UTC 2012

On Jun 4, 2012, at 8:48 AM, Jan Owoc wrote:
> On Mon, Jun 4, 2012 at 9:24 AM, Nick Hall <darknovanick at gmail.com> wrote:
>> I'm considering buying a separate SSD drive for my ZIL as I do quite a bit
>> over NFS and would like the latency to improve. But first I'm trying to
>> understand exactly how the ZIL works and what happens in case of a problem.
>> I'll list my understanding here, and I'm hoping someone can correct me if
>> I'm understanding this incorrectly:
> 
> The ZIL fixes latency with synchronous writes.

I believe you mean, "the slog can improve latency for synchronous writes."

> Do you have a workload
> that you can benchmark with the ZIL disabled to determine if it's
> indeed the ZIL that's slowing you down?
> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29
> 
> (just remember to re-enable the ZIL after you are done benchmarking,
> if you care about data integrity)
> 
> 
>> - If the ZIL drive were to die while the system were running, I'm assuming
>> no data would be lost? In order for this to work, the system would need to
>> cache everything in the ZIL in RAM, so if the ZIL were to die, it would
>> write the transactions that were on the ZIL from RAM to the main pool
>> drives. Applications would not notice anything from their perspective. Is
>> this what happens?
> 
> The ZIL is sort of like a journal. Your application issues a "sync"
> and ZFS isn't supposed to return from the sync until the data actually
> makes it onto disk. With platters rotating etc., this can take tens of
> miliseconds. A ZIL on NVRAM (or an SSD) would allow this sync'ed data
> to hit the fast-write device, and the system call to return
> immediately. The data will also, as you'd read, make it to the disk in
> 5-30 seconds. Yes, a copy stays in the RAM, and it's this copy that is
> normally written (and not a copy re-read from the ZIL).

It is more appropriate to say the ZIL is like a database redo log. The term
"journal" is overloaded wrt file systems and can be confusing. Since the
core of ZFS is a transactional object store, the redo log notion fits better.

>> - So far, assuming I'm understanding this correctly, none of the above
>> scenarios involve any data loss. The scenario I can think of that would
>> involve data loss is if there's a power failure and the ZIL drive at the
>> same time. It seems likely that this scenario would be caused by
>> a catastrophic hardware failure, and the main system drives would also die,
>> but let's pretend that only the ZIL drive is affected. So any transactions
>> stored in the ZIL are lost. I'm thinking that the system would boot up,
>> note that the ZIL drive is dead and switch the ZIL back to the main pool
>> drives, and the last 5-30 seconds or writes would be lost forever. But
>> would the system be in a consistent state, that is, things would be the
>> same as if you went back in time 30 seconds before the system died and just
>> pulled the plug? So there's no corruption, just the loss of those seconds
>> of data?
> 
> I don't have first-hand experience with this case, so maybe someone
> can correct me if I'm wrong.
> 
> The data on the main pool is always consistent in that a certain
> operation either made it to the disk or it didn't. However, if your
> application depends on the fact that writes make it out to disk in a
> specific order (that's why it's sync'ing, right?), then it's the ZIL
> that would contain a log/journal of what should have been written to
> the disk and in what order. If you lose this, your file system remains
> consistent, but some writes may have made it out to the disk before
> others.

Correct.

>> My use is for a home
>> server -- I would like higher NFS write performance, but not by making it
>> more likely I have corrupted or majorly lost data, but for my use, if I
>> only lost the last few seconds or writes and things were in a consistent
>> state, it would be of little consequence.
> 
> You need to first find out if your writes are synchronous or not,
> otherwise you are wasting your time (and money) getting a separate log
> device. It's mostly databases that require that file operations happen
> in a specific order - for a home file server, you might not see any
> benefit to a separate log device. Next, make sure you get an SSD with
> fast sequential writes - many SSDs focus on random read speed (that's
> what a desktop user wants to see).

Two tools to help with understanding your client's workload: nfssvrtop and zilstat.
nfssvrtop is probably the first to use in this case. Both are now on my github repo :-)
	https://github.com/richardelling/tools

NB, ZIL is always present, a separate log device (slog) is often inappropriately
called a ZIL.

 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422