[OpenIndiana-discuss] What happens when a ZIL drive dies?

Mon Jun 4 15:56:55 UTC 2012

On Jun 4, 2012, at 8:24 AM, Nick Hall wrote:

> I'm considering buying a separate SSD drive for my ZIL as I do quite a bit
> over NFS and would like the latency to improve. But first I'm trying to
> understand exactly how the ZIL works and what happens in case of a problem.
> I'll list my understanding here, and I'm hoping someone can correct me if
> I'm understanding this incorrectly:
> 
> - In normal operation, the ZIL drive would just be written to but never
> read from.
> 
> - In the case of a power failure, the ZIL will probably contain 5-10
> seconds (maybe up to 30 seconds) worth of writes that didn't make it onto
> the main hard drives. The next time the system boots, ZFS will use what's
> in the ZIL to bring the main hard drives up to date.
> 
> - I'm running ZFS version 28 -- in this version, if the ZIL drive were to
> die while the system were running, the system would switch back to using
> the main pool hard drives to store the ZIL, just as it currently does since
> I have no separate ZIL drive right now.

The above is conceptually correct. For the systems running zpool version 28, it
is also likely the default txg commit interval is 5 seconds.

> So, now I have a couple of questions:
> 
> - If the ZIL drive were to die while the system were running, I'm assuming
> no data would be lost? In order for this to work, the system would need to
> cache everything in the ZIL in RAM, so if the ZIL were to die, it would
> write the transactions that were on the ZIL from RAM to the main pool
> drives. Applications would not notice anything from their perspective. Is
> this what happens?

For data, there is nothing in the ZIL that is not also in the ARC.

> - So far, assuming I'm understanding this correctly, none of the above
> scenarios involve any data loss. The scenario I can think of that would
> involve data loss is if there's a power failure and the ZIL drive at the
> same time. It seems likely that this scenario would be caused by
> a catastrophic hardware failure, and the main system drives would also die,
> but let's pretend that only the ZIL drive is affected. So any transactions
> stored in the ZIL are lost. I'm thinking that the system would boot up,
> note that the ZIL drive is dead and switch the ZIL back to the main pool
> drives, and the last 5-30 seconds or writes would be lost forever. But
> would the system be in a consistent state, that is, things would be the
> same as if you went back in time 30 seconds before the system died and just
> pulled the plug? So there's no corruption, just the loss of those seconds
> of data?

In general, it takes a double failure: the slog device and something else.

> - Are there any other scenarios I'm not thinking, specifically any other
> scenarios that would cause corruption or loss of data? My use is for a home
> server -- I would like higher NFS write performance, but not by making it
> more likely I have corrupted or majorly lost data, but for my use, if I
> only lost the last few seconds or writes and things were in a consistent
> state, it would be of little consequence. I understand that for a
> commercial server that would be huge issue, though, as banking transactions
> lost or something would be a major problem. Thanks.

For NFS workloads, the ZIL implements the synchronous semantics between
the NFS server and client. The best way to get better performance is to have the
client run in async mode when possible (Solaris clients do this automatically, and 
have for a very long time, Linux... not so much).

The risk is that the server unexpectedly reboots and the synchronous writes from
the client are lost. In that case, the client thinks data is written, but it is not. The 
server is happy either way... it is the client that is sad.

There are some failure modes that can impact these systems that might not be
expected. For example, if the slog device responds very slowly to the write, then
the ripple effect impacts the performance as perceived by the client. For consumer-
grade flash SSDs, this can occur more often than for enterprise-grade flash SSDs.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422