[OpenIndiana-discuss] OI 151_a7 crashing when trying to import ZFS pool

Tue Jul 9 14:18:00 UTC 2013

Jim,
    Yes, the MegaRaid 92608i is presenting the OS with a single disk 
pool.  I had to stop the scrub as it was going to take over three days 
to complete. I figured I could be well into me restores from backup in 
that amount of time.
    I was wondering what the cause would of been, the OS or the raid 
controller?  The system was completely hung, so I had no choice be do a 
force reboot.  I'm leaning towards the raid controller as the primary 
culprit.
    I'm not finding the core dump file anywhere. Where does OI store 
kernel crash data?  If it's not too big I could send it in to the list?


On 7/9/13 5:22 AM, Jim Klimov wrote:
> On 2013-07-08 22:58, CJ Keist wrote:
>> Thank you all for the replies.
>>    I tried OmniOS and Oracle Solaris 11.1 but both were not able to
>> import the data pool. So I have reinstalled OI 151a7 and after importing
>> the data and having it crash, I booted up in single user mode.  At this
>> point I was able to initiate zpool scrub data and it looks to be
>> running!!  I will wait and see if the scrub can finish and then try to
>> remount everything. See attached pic.
>
> That screenshot seems disturbing: with such a large pool you only have
> one device. Is it on hardware RAID which masks away all the disks and
> possible redundancy and repair variants away from ZFS? In that case,
> the data error maybe anywhere in that RAID's implementation (i.e. when
> you did a force-reboot, some critical data was not flushed to disks
> at all, or worse - in a wrong order - for example uberblock updates
> came before the other metadata updates, and the latter never made it).
>
> I think that for the scrub you did mount the pool read-write, so it
> would be too late to try rolling back a few transactions into an older
> but possibly more consistent state of the pool (or did you already do
> that while successfully importing?)
>
> If the pool just "gave up" and after a few panics began to import at
> least so much that the kernel accepts it, it is possible (just from
> my experience, shooting ideas into sky here) that some deferred ops
> were recorded on the pool, and it finally unrolled them. For example,
> I had a series of panicky reboots when deleting lots of data on a
> deduped pool on a machine with low RAM (8Gb) - enumerating the DDT
> consumed a lot more, the kernel couldn't swap, BAM! Took about two
> weeks of resetting it every 3-4 hours, for the box to get itself
> straight...
>
> For the developers here to provide more targeted ideas and/or make a
> solution, it would sure be helpful if you could provide a stack trace
> of the kernel panic - to see where it goes wrong (probably, some data
> on disk did not match an assertion like unexpected zero/nonzero value).
> For this you could boot into kmdb (preferably on a serial console,
> the traces are quite long and roll off the 25-line screen), so that
> when the problem occurs - the messages are printed but the machine
> doesn't reboot automatically. Actually, with a serial console you
> might care a bit less about kmdb - if you can copy-paste the trace
> quickly enough before it is overwritten by BIOS POST messages.
>
> HTH,
> //Jim
>
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

-- 
C. J. Keist                     Email: cj.keist at colostate.edu
Systems Group Manager           Solaris 10 OS (SAI)
Engineering Network Services    Phone: 970-491-0630
College of Engineering, CSU     Fax:   970-491-5569
Ft. Collins, CO 80523-1301

All I want is a chance to prove 'Money can't buy happiness'