[OpenIndiana-discuss] Zpool crashes system on reboot and import

Wed Dec 11 17:13:13 UTC 2013

Here is /var/adm/messages at time of crash if this helps:

Dec 10 17:02:27 projects2 unix: [ID 836849 kern.notice]
Dec 10 17:02:27 projects2 ^Mpanic[cpu3]/thread=ffffff03e85997c0:
Dec 10 17:02:27 projects2 genunix: [ID 335743 kern.notice] BAD TRAP: 
type=e (#pf Page fault) rp=ffffff001803c340 addr=20 occurred in module 
"zfs" due to a NULL pointer dereference
Dec 10 17:02:27 projects2 unix: [ID 100000 kern.notice]
Dec 10 17:02:27 projects2 unix: [ID 839527 kern.notice] zpool:
Dec 10 17:02:27 projects2 unix: [ID 753105 kern.notice] #pf Page fault
Dec 10 17:02:27 projects2 unix: [ID 532287 kern.notice] Bad kernel fault 
at addr=0x20
Dec 10 17:02:27 projects2 unix: [ID 243837 kern.notice] pid=718, 
pc=0xfffffffff7a220e8, sp=0xffffff001803c438, eflags=0x10213
Dec 10 17:02:27 projects2 unix: [ID 211416 kern.notice] cr0: 
8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
Dec 10 17:02:27 projects2 unix: [ID 624947 kern.notice] cr2: 20
Dec 10 17:02:27 projects2 unix: [ID 625075 kern.notice] cr3: 337a37000
Dec 10 17:02:27 projects2 unix: [ID 625715 kern.notice] cr8: c
Dec 10 17:02:27 projects2 unix: [ID 100000 kern.notice]
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         rdi: 
ffffff03f66e8058 rsi:                0 rdx:                0
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         rcx: 
      f7728503  r8:          88d7af8  r9: ffffff001803c4a0
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         rax: 
             7 rbx:                0 rbp: ffffff001803c480
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         r10: 
             7 r11:                0 r12: ffffff03f66e8058
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         r13: 
ffffff03f66e8058 r14: ffffff001803c5c0 r15: ffffff001803c600
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         fsb: 
             0 gsb: ffffff03e1067040  ds:               4b
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]          es: 
            4b  fs:                0  gs:              1c3
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]         trp: 
             e err:                0 rip: fffffffff7a220e8
Dec 10 17:02:27 projects2 unix: [ID 592667 kern.notice]          cs: 
            30 rfl:            10213 rsp: ffffff001803c438
Dec 10 17:02:27 projects2 unix: [ID 266532 kern.notice]          ss: 
            38
Dec 10 17:02:27 projects2 unix: [ID 100000 kern.notice]
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c210 unix:die+dd ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c330 unix:trap+17db ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c340 unix:cmntrap+e6 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c480 zfs:zap_leaf_lookup_closest+40 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c510 zfs:fzap_cursor_retrieve+c9 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c5a0 zfs:zap_cursor_retrieve+17d ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c780 zfs:zfs_purgedir+4c ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c7d0 zfs:zfs_rmnode+50 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c810 zfs:zfs_zinactive+b5 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c860 zfs:zfs_inactive+11a ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c8b0 genunix:fop_inactive+af ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803c8d0 genunix:vn_rele+5f ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803cac0 zfs:zfs_unlinked_drain+af ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803caf0 zfs:zfsvfs_setup+102 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803cb50 zfs:zfs_domount+17c ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803cc70 zfs:zfs_mount+1cd ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803cca0 genunix:fsop_mount+21 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803ce00 genunix:domount+b0e ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803ce80 genunix:mount+121 ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803cec0 genunix:syscall_ap+8c ()
Dec 10 17:02:27 projects2 genunix: [ID 655072 kern.notice] 
ffffff001803cf10 unix:brand_sys_sysenter+1c9 ()
Dec 10 17:02:27 projects2 unix: [ID 100000 kern.notice]
Dec 10 17:02:27 projects2 genunix: [ID 672855 kern.notice] syncing file 
systems...
Dec 10 17:02:27 projects2 genunix: [ID 904073 kern.notice]  done
Dec 10 17:02:28 projects2 genunix: [ID 111219 kern.notice] dumping to 
/dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Dec 10 17:02:52 projects2 genunix: [ID 100000 kern.notice]
Dec 10 17:02:52 projects2 genunix: [ID 665016 kern.notice] ^M100% done: 
225419 pages dumped,
Dec 10 17:02:52 projects2 genunix: [ID 851671 kern.notice] dump succeeded

On 12/11/13, 6:21 AM, jimklimov at cos.ru wrote:
> It might help if you could run mdb over the kernel crashdump files so developers would at least have a stack trace of what went bad. Maybe they would have more specific questions on variable values etc. and would post those - but the general debugging steps (see Wiki) come first anyway.
>
> So far you can also try to trace your pool with zdb -bscvL or similar to check for inconsistensies - i.e. if the box crashed/rebooted with io's written out of order like labels or uberblocks updated before the data they point to was committed, if the disks/caches lied.
>
> Then you might have luck rolling back a few txg's on import, and you can model if this helps with zdb as well (so it would start with an older txg number and skip the possibly corrupted last few sync cycles).
>
> Hth, Jim
>
>
> Typos courtesy of my Samsung Mobile
>
> -------- Исходное сообщение --------
> От: CJ Keist <cj.keist at colostate.edu>
> Дата: 2013.12.11  5:31  (GMT+01:00)
> Кому: Discussion list for OpenIndiana <openindiana-discuss at openindiana.org>
> Тема: [OpenIndiana-discuss] Zpool crashes system on reboot and import
>
> All,
>       Some time back we had issue where I last entire zpool file system
> due to possible bad raid controller card.  At that time I was strongly
> encouraged to get a raid card that supported JBOD and allow ZFS to
> control all disks.  Well I did that and unfortunately today I lost an
> entire zpool that was configured with mutiple raidz2 volumes. See below:
>
> root at projects2:~# zpool status data
>     pool: data
>    state: ONLINE
>     scan: scrub in progress since Tue Dec 10 18:11:19 2013
>       211G scanned out of 30.2T at 1/s, (scan is slow, no estimated time)
>       0 repaired, 0.68% done
> config:
>
>           NAME                       STATE     READ WRITE CKSUM
>           data                       ONLINE       0     0     0
>             raidz2-0                 ONLINE       0     0     0
>               c3t50014EE25D929FBCd0  ONLINE       0     0     0
>               c3t50014EE2B2E8E02Ed0  ONLINE       0     0     0
>               c3t50014EE25C346397d0  ONLINE       0     0     0
>               c3t50014EE206EB0DDDd0  ONLINE       0     0     0
>               c3t50014EE25D932FC7d0  ONLINE       0     0     0
>               c3t50014EE25C341621d0  ONLINE       0     0     0
>               c3t50014EE206DE835Ed0  ONLINE       0     0     0
>               c3t50014EE2083D20DAd0  ONLINE       0     0     0
>               c3t50014EE2083D842Ed0  ONLINE       0     0     0
>             raidz2-1                 ONLINE       0     0     0
>               c3t50014EE2B2E8D8CCd0  ONLINE       0     0     0
>               c3t50014EE2B18BE3A4d0  ONLINE       0     0     0
>               c3t50014EE25C339C05d0  ONLINE       0     0     0
>               c3t50014EE25D9307DAd0  ONLINE       0     0     0
>               c3t50014EE2B2E7E5E8d0  ONLINE       0     0     0
>               c3t50014EE206EB20ABd0  ONLINE       0     0     0
>               c3t50014EE2B2E56CFAd0  ONLINE       0     0     0
>               c3t50014EE25D92FC0Ad0  ONLINE       0     0     0
>               c3t50014EE25C42CFDBd0  ONLINE       0     0     0
>             raidz2-2                 ONLINE       0     0     0
>               c3t50014EE25D933003d0  ONLINE       0     0     0
>               c3t50014EE2B2E89EF3d0  ONLINE       0     0     0
>               c3t50014EE2B2E8DC9Cd0  ONLINE       0     0     0
>               c3t50014EE25C35933Ed0  ONLINE       0     0     0
>               c3t50014EE2B1968F65d0  ONLINE       0     0     0
>               c3t50014EE2083D6987d0  ONLINE       0     0     0
>               c3t50014EE2083DDCACd0  ONLINE       0     0     0
>               c3t50014EE25C42C384d0  ONLINE       0     0     0
>               c3t50014EE206F2A389d0  ONLINE       0     0     0
>             raidz2-3                 ONLINE       0     0     0
>               c3t50014EE2B1967C56d0  ONLINE       0     0     0
>               c3t50014EE2083E1931d0  ONLINE       0     0     0
>               c3t50014EE2B1895807d0  ONLINE       0     0     0
>               c3t50014EE25D9333E7d0  ONLINE       0     0     0
>               c3t50014EE2B196397Ad0  ONLINE       0     0     0
>               c3t50014EE25D930567d0  ONLINE       0     0     0
>               c3t50014EE2B19D4F5Ad0  ONLINE       0     0     0
>               c3t50014EE25D930525d0  ONLINE       0     0     0
>               c3t50014EE2083DDCFAd0  ONLINE       0     0     0
>             raidz2-4                 ONLINE       0     0     0
>               c3t50014EE20721B2BBd0  ONLINE       0     0     0
>               c3t50014EE2B2E8DC6Ad0  ONLINE       0     0     0
>               c3t50014EE25C40CF9Fd0  ONLINE       0     0     0
>               c3t50014EE25D24BC9Fd0  ONLINE       0     0     0
>               c3t50014EE2B2E8DFDAd0  ONLINE       0     0     0
>               c3t50014EE25C33BF64d0  ONLINE       0     0     0
>               c3t50014EE25D9328C4d0  ONLINE       0     0     0
>               c3t50014EE25C401FBFd0  ONLINE       0     0     0
>               c3t50014EE2B1899AC5d0  ONLINE       0     0     0
>
> errors: No known data errors
>
> The system crashed and when rebooted it would just core dump and
> reboot.  After booting in single user mode I found the zpool that was
> crashing the system. Exported that out and was able to bring the system
> back up. When I try to import that pool it would again crash my system.
> I finally found that I could import the pool without crashing my system
> if I imported it read only:
>
> zpool import -o readonly=on data
>
> That is the output I have now above from the pool imported as readonly.
> Looking for any advice on way to save this pool??? As you can see zpool
> reports no errors with the pool.
>
> Running OI 151a8 i86pc i386 i86pc Solaris
>

-- 
C. J. Keist                     Email: cj.keist at colostate.edu
Systems Group Manager           Solaris 10 OS (SAI)
Engineering Network Services    Phone: 970-491-0630
College of Engineering, CSU     Fax:   970-491-5569
Ft. Collins, CO 80523-1301

All I want is a chance to prove 'Money can't buy happiness'