[OpenIndiana-discuss] OI 151_a7 crashing when trying to import ZFS pool

Tue Jul 9 11:22:34 UTC 2013

On 2013-07-08 22:58, CJ Keist wrote:
> Thank you all for the replies.
>    I tried OmniOS and Oracle Solaris 11.1 but both were not able to
> import the data pool. So I have reinstalled OI 151a7 and after importing
> the data and having it crash, I booted up in single user mode.  At this
> point I was able to initiate zpool scrub data and it looks to be
> running!!  I will wait and see if the scrub can finish and then try to
> remount everything. See attached pic.

That screenshot seems disturbing: with such a large pool you only have
one device. Is it on hardware RAID which masks away all the disks and
possible redundancy and repair variants away from ZFS? In that case,
the data error maybe anywhere in that RAID's implementation (i.e. when
you did a force-reboot, some critical data was not flushed to disks
at all, or worse - in a wrong order - for example uberblock updates
came before the other metadata updates, and the latter never made it).

I think that for the scrub you did mount the pool read-write, so it
would be too late to try rolling back a few transactions into an older
but possibly more consistent state of the pool (or did you already do
that while successfully importing?)

If the pool just "gave up" and after a few panics began to import at
least so much that the kernel accepts it, it is possible (just from
my experience, shooting ideas into sky here) that some deferred ops
were recorded on the pool, and it finally unrolled them. For example,
I had a series of panicky reboots when deleting lots of data on a
deduped pool on a machine with low RAM (8Gb) - enumerating the DDT
consumed a lot more, the kernel couldn't swap, BAM! Took about two
weeks of resetting it every 3-4 hours, for the box to get itself
straight...

For the developers here to provide more targeted ideas and/or make a
solution, it would sure be helpful if you could provide a stack trace
of the kernel panic - to see where it goes wrong (probably, some data
on disk did not match an assertion like unexpected zero/nonzero value).
For this you could boot into kmdb (preferably on a serial console,
the traces are quite long and roll off the 25-line screen), so that
when the problem occurs - the messages are printed but the machine
doesn't reboot automatically. Actually, with a serial console you
might care a bit less about kmdb - if you can copy-paste the trace
quickly enough before it is overwritten by BIOS POST messages.

HTH,
//Jim