[OpenIndiana-discuss] What happens when a ZIL drive dies?

Tue Jun 5 18:08:15 UTC 2012

On Tue, Jun 5, 2012 at 11:32 AM, Nick Hall <darknovanick at gmail.com> wrote:
> On Mon, Jun 4, 2012 at 10:48 AM, Jan Owoc <jsowoc at gmail.com> wrote:
>>
>> The data on the main pool is always consistent in that a certain
>> operation either made it to the disk or it didn't. However, if your
>> application depends on the fact that writes make it out to disk in a
>> specific order (that's why it's sync'ing, right?), then it's the ZIL
>> that would contain a log/journal of what should have been written to
>> the disk and in what order. If you lose this, your file system remains
>> consistent, but some writes may have made it out to the disk before
>> others.
>
> I'm just wondering, for my own personal knowledge and for anyone else who
> finds this thread later, for some clarification on the above quote. So, if
> I'm understanding this correctly, are you saying that, say I have an
> application and it writes to file A, then it writes to file B, then it
> writes to file C, then finally calls fsync, that there could be a case
> where if the computer crashed and at the same time the SLOG got fried
> (after files A B and C were written to, but before the sync was finished),
> then upon restart, the write to file B may have taken affect on the pool
> but the write to file A wouldn't be on there? Or am I misunderstanding?
> Usually when I think of journals I would think it would roll back the
> change to file B because it doesn't have a record in the journal to
> indicate that the sync was successful. I understand the possibility of
> loosing the last few seconds of writes in this scenario -- I'm just trying
> to wrap my head around the possibility of losing *part* of the last few
> seconds of data, and the much worse implications this has. Thanks,

Maybe I can clarify my own quote :-).

Many filesystems attempt to maintain internal consistency. Either a
file is there, or it isn't. The filesize needs to match what is
actually on disk. The free space needs to match what isn't actually
used etc. Regardless of whether or not you have ZIL enabled, and
whether you have a SLOG or not, the filesystem will remain internally
consistent.

Some applications may depend on the files (or portions thereof) making
it out to disk in a specific order. The example you gave is perfect.
Let's say file "A" needs to exist before a change in file "B" happens.
A properly written program would write out file "A", fsync, wait for
fsync to complete, then change file "B". A properly written filesystem
will wait for the data from "A" to hit physical, permanent, storage
before allowing the change to "B". This can be slow on mechanical
storage, so you put the ZIL on a separate SLOG.

The SLOG will contain information like "hey, this file 'A' needs to
make it out to sector 135425, then this file 'B' needs to be updated".
The disks themselves will be written out when and how convenient, but
if a crash were to occur, the ZIL will be replayed (either from the
SLOG or from the disk) and file 'A' *will* make it out to disk if file
'B' does. A damaged SLOG could mean that file 'B' was changed on disk,
but file 'A' never made it out.

Something else to keep in mind with NFS is that you could have the
remote system crashing and rebooting without your local system
knowing. Applications have files open on the remote system and assume
the remote file system to be in a certain state... and it isn't if the
SLOG fails. Some people recommend mirroring SLOGs.

Jan