[OpenIndiana-discuss] Openindiana ZFS server crashes and reboots

Bentley, Dain DBentley at nas.edu
Fri Oct 12 18:09:38 UTC 2012


I see the checksum errors with zpool status -v but the drives are good as far as I know.  Fmadm faulty reports nothing:

 root at Atlas:/home/user# zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c3t1d0s0  ONLINE       0     0     0

errors: No known data errors

  pool: volume0
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 50.5K in 0h14m with 0 errors on Fri Oct 12 09:37:27 2012
config:

        NAME        STATE     READ WRITE CKSUM
        volume0     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     2
            c3t0d0  ONLINE       0     0     3
            c3t3d0  ONLINE       0     0     4


root at Atlas:/home/user# fmadm faulty
root at Atlas:/home/user#



Funny thing is the only happens during big copies when I'm moving an image from one source to another via the network.


-----Original Message-----
From: Udo Grabowski (IMK) [mailto:udo.grabowski at kit.edu] 
Sent: Friday, October 12, 2012 2:00 PM
To: Discussion list for OpenIndiana
Subject: Re: [OpenIndiana-discuss] Openindiana ZFS server crashes and reboots

On 10/12/12 07:34 PM, Bentley, Dain wrote:
> Hello Udo, thanks for the reply.  Here is the text from fmdump -eV.
 > Is there anything I should be looking for?

So BOTH disks spit out ZFS checksum errors like a machine gun, this can either be a controller/cable problem or a memory problem (you don't have ECC memory with an ECC supporting processor, e.g.
Xeon ? Otherwise those errors would be reported by fmdump -e).
Or some problem in the OS, those checksum problems haunt me on my home workstation (also no ECC) when scrubbing the rpool mirror, although disks and cables are ok and no errors occur when not scrubbing. But the sheer amount of errors does not look like that symptom.

'zpool status -v' should show some degradation of the mirror with the exact checksum count, and also if files or metadata are affected, 'fmadm faulty' gives components retired due to errors. If you have data corruption, this could cause reboots on some occasion.

Hard to say how to hunt this down, other than checking cables, memory seating, looking for newer HBA or disk/BIOS firmware, torturing memory with advanced memory testers.
The fact that the machine does reboot let me doubt that this is a pure ZFS/disk problem, I never saw that, the machine eventually stalls when those errors hammer the machine too hard, but it would not cause a reboot. Memory is always the best bet, but it could also be a motherboard problem.
Maybe the vendor has some hardware checking tool on the CD in the box or on his website ?

>
> TIME                           CLASS
> Oct 11 2012 06:06:59.370787527 ereport.fs.zfs.data nvlist version: 0
>          class = ereport.fs.zfs.data
>          ena = 0x9ebd5c95f9f00401
>          detector = (embedded nvlist)
>          nvlist version: 0
>                  version = 0x0
>                  scheme = zfs
>                  pool = 0x6a8f5a381b7e2f2c
>          (end detector)
>
>          pool = volume0
>          pool_guid = 0x6a8f5a381b7e2f2c
>          pool_context = 0
>          pool_failmode = wait
>          zio_err = 50
>          zio_objset = 0x28
>          zio_object = 0x1
>          zio_level = 0
>          zio_blkid = 0x6047da
>          __ttl = 0x1
>          __tod = 0x50769a43 0x1619c4c7
>
> Oct 11 2012 06:06:59.370787942 ereport.fs.zfs.checksum nvlist version: 
> 0
>          class = ereport.fs.zfs.checksum
>          ena = 0x9ebd5c95f9f00401
>          detector = (embedded nvlist)
>          nvlist version: 0
>                  version = 0x0
>                  scheme = zfs
>                  pool = 0x6a8f5a381b7e2f2c
>                  vdev = 0xdd2eef656bfc1db5
>          (end detector)
>
>          pool = volume0
>          pool_guid = 0x6a8f5a381b7e2f2c
>          pool_context = 0
>          pool_failmode = wait
>          vdev_guid = 0xdd2eef656bfc1db5
>          vdev_type = disk
>          vdev_path = /dev/dsk/c3t0d0s0
>          vdev_devid = id1,sd at SATA_____HDS725050KLA360_______KRVN03ZAG1JY4D/a
>          parent_guid = 0xfd55a23a93069d20
>          parent_type = mirror
>          zio_err = 50
>          zio_offset = 0x14893b600
>          zio_size = 0x4600
>          zio_objset = 0x28
>          zio_object = 0x1
>          zio_level = 0
>          zio_blkid = 0x6047da
>          cksum_expected = 0x521c0ebafd7 0x2be41b8abfd1b3 0xfccf5694dcdbaa49 0x3e24575fd2f4aa4d
>          cksum_actual = 0x521c2ebafd7 0x2be436a2bfd1b3 0xfcd00e26f8dbaa49 0x4161c187aaf4aa4d
>          cksum_algorithm = fletcher4
>          __ttl = 0x1
>          __tod = 0x50769a43 0x1619c666
>
> Oct 11 2012 06:06:59.370788126 ereport.fs.zfs.checksum nvlist version: 
> 0
>          class = ereport.fs.zfs.checksum
>          ena = 0x9ebd5c95f9f00401
>          detector = (embedded nvlist)
>          nvlist version: 0
>                  version = 0x0
>                  scheme = zfs
>                  pool = 0x6a8f5a381b7e2f2c
>                  vdev = 0x377ede8d0fb06f7e
>          (end detector)
>
>          pool = volume0
>          pool_guid = 0x6a8f5a381b7e2f2c
>          pool_context = 0
>          pool_failmode = wait
>          vdev_guid = 0x377ede8d0fb06f7e
>          vdev_type = disk
>          vdev_path = /dev/dsk/c3t3d0s0
>          vdev_devid = id1,sd at SATA_____HDS725050KLA360_______KRVN03ZAG39LKD/a
>          parent_guid = 0xfd55a23a93069d20
>          parent_type = mirror
>          zio_err = 50
>          zio_offset = 0x14893b600
>          zio_size = 0x4600
>          zio_objset = 0x28
>          zio_object = 0x1
>          zio_level = 0
>          zio_blkid = 0x6047da
>          cksum_expected = 0x521c0ebafd7 0x2be41b8abfd1b3 0xfccf5694dcdbaa49 0x3e24575fd2f4aa4d
>          cksum_actual = 0x521c2ebafd7 0x2be436a2bfd1b3 0xfcd00e26f8dbaa49 0x4161c187aaf4aa4d
>          cksum_algorithm = fletcher4
>          __ttl = 0x1
>          __tod = 0x50769a43 0x1619c71e
>
> ... (1000 more of them in rapid succession)...
 >
> Oct 12 2012 07:54:53.123217977 ereport.fs.zfs.checksum nvlist version: 
> 0
>          class = ereport.fs.zfs.checksum
>          ena = 0xe644965bba100401
>          detector = (embedded nvlist)
>          nvlist version: 0
>                  version = 0x0
>                  scheme = zfs
>                  pool = 0xda1ed4eddd886ca2
>                  vdev = 0x76d6d5ecc9007061
>          (end detector)
>
>          pool = volume0
>          pool_guid = 0xda1ed4eddd886ca2
>          pool_context = 0
>          pool_failmode = wait
>          vdev_guid = 0x76d6d5ecc9007061
>          vdev_type = disk
>          vdev_path = /dev/dsk/c3t0d0s0
>          vdev_devid = id1,sd at SATA_____HDS725050KLA360_______KRVN03ZAG1JY4D/a
>          parent_guid = 0xb85bb36665652fa6
>          parent_type = mirror
>          zio_err = 50
>          zio_offset = 0x6414e7e00
>          zio_size = 0xa400
>          zio_objset = 0x28
>          zio_object = 0x1
>          zio_level = 0
>          zio_blkid = 0x1c3121
>          cksum_expected = 0xd682bb3691e 0x110e9a533de6720 0x6bb562c60c3d27a8 0xd15dd859803c65fd
>          cksum_actual = 0xd682db3691e 0x110e9df4bde6720 0x6bb8ae9ba83d27a8 0xf14a4b22583c65fd
>          cksum_algorithm = fletcher4
>          bad_ranges = 0x2fd0 0x2fd8
>          bad_ranges_min_gap = 0x8
>          bad_range_sets = 0x1
>          bad_range_clears = 0x0
>          bad_set_bits = 0x0 0x0 0x0 0x2 0x0 0x0 0x0 0x0
>          bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
>          __ttl = 0x1
>          __tod = 0x5078050d 0x7582839
>
> Oct 12 2012 09:31:27.252886984 ereport.fs.zfs.checksum nvlist version: 
> 0
>          class = ereport.fs.zfs.checksum
>          ena = 0x3a9567025fb00801
>          detector = (embedded nvlist)
>          nvlist version: 0
>                  version = 0x0
>                  scheme = zfs
>                  pool = 0xda1ed4eddd886ca2
>                  vdev = 0x4dd27d7d2cff5683
>          (end detector)
>
>          pool = volume0
>          pool_guid = 0xda1ed4eddd886ca2
>          pool_context = 0
>          pool_failmode = wait
>          vdev_guid = 0x4dd27d7d2cff5683
>          vdev_type = disk
>          vdev_path = /dev/dsk/c3t3d0s0
>          vdev_devid = id1,sd at SATA_____HDS725050KLA360_______KRVN03ZAG39LKD/a
>          parent_guid = 0xb85bb36665652fa6
>          parent_type = mirror
>          zio_err = 50
>          zio_offset = 0x4f8cf3c00
>          zio_size = 0xca00
>          zio_objset = 0x28
>          zio_object = 0x1
>          zio_level = 0
>          zio_blkid = 0x110e12
>          cksum_expected = 0x109aa00f800f 0x1aae29387c03843 0x289b0f3d871cdb32 0xb91c890d879e9fc4
>          cksum_actual = 0x109aa20f800f 0x1aae2d39fc03843 0x289f125e231cdb32 0xe3fb48d05f9e9fc4
>          cksum_algorithm = fletcher4
>          bad_ranges = 0x49d0 0x49d8
>          bad_ranges_min_gap = 0x8
>          bad_range_sets = 0x1
>          bad_range_clears = 0x0
>          bad_set_bits = 0x0 0x0 0x0 0x2 0x0 0x0 0x0 0x0
>          bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
>          __ttl = 0x1
>          __tod = 0x50781baf 0xf12bfc8
>


-- 
Dr.Udo Grabowski    Inst.f.Meteorology a.Climate Research IMK-ASF-SAT
www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology            http://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany  T:(+49)721 608-26026 F:-926026




More information about the OpenIndiana-discuss mailing list