[OpenIndiana-discuss] Huge ZFS root pool slowdown - diagnose root cause?

Mon Dec 17 09:21:16 UTC 2018

On December 13, 2018 9:14:46 PM UTC, Lou Picciano <LouPicciano at comcast.net> wrote:
>The plot thickens, I’m afraid. Since last post, I’ve replaced the
>drive, and throughput remains molasses-in-January slow…
>period indicated below is more than 24 hours:
>
>  scan: resilver in progress since Wed Dec 12 15:13:11 2018
>26.9G scanned out of 1.36T at 345K/s, (scan is slow, no estimated time)
>    26.9G resilvered, 1.93% done
>config:
>
>        NAME                STATE     READ WRITE CKSUM
>        rpool               DEGRADED     0     0     0
>          mirror-0          DEGRADED     0     0     0
>            replacing-0     DEGRADED     0     0     0
>              c2t0d0s0/old  OFFLINE      0     0     0
>              c2t0d0s0      ONLINE       0     0     0
>            c2t1d0s0        ONLINE       0     0     0
>
>I have added boot blocks during silvering, but have used the bootadm
>install-bootloader approach. Reports it’s done this (not GRUB; aren’t
>we on the Boot Forth loader now?), 
>
>"/usr/sbin/installboot -F -m -f //boot/pmbr //boot/gptzfsboot
>/dev/rdsk/c2t0d0s0”
>"/usr/sbin/installboot -F -m -f //boot/pmbr //boot/gptzfsboot
>/dev/rdsk/c2t1d0s0"
>
>Will double-check this when silvering completes. Could be a long time…
>Machine has not been rebooted yet at all.
>
>Bob, yes: This is a 4K sector drive. Bad? What’s the impact?
>
>Slowness at boot: Yes, immediately. Well before scrub or any other
>process had had a chance to grab hold.
>
>What’s next? Could it be as simple as a cable? These cables haven’t
>been perturbed in… a long time.
>
>Can’t do anything safely now until this silvering is completed,
>correct?
>
>Wow. A mess.
>
>Lou Picciano
>
>> On Dec 11, 2018, at 7:56 PM, jason matthews <jason at broken.net> wrote:
>> 
>> 
>> On 12/11/18 10:14 AM, John D Groenveld wrote:
>>> And when its replaced, I believe the OP will need to installboot(1M)
>>> the new drive.
>>> Correct me if I'm wrong, but Illumos ZFS doesn't magically put
>>> the boot code with zpool replace.
>> 
>> man installgrub
>> 
>> installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/<whatever>
>> 
>> 
>> should take like one second and can be done before, after, or during
>the resilver.
>_______________________________________________
>openindiana-discuss mailing list
>openindiana-discuss at openindiana.org
>https://openindiana.org/mailman/listinfo/openindiana-discuss

To me it also seems similar to a very fragmented i/o (e.g. pool was near full earlier, and wrote many scattered bits), so while scrub tries to read data from a TXG, and/or read a large file like miniroot etc., it has to do a lot of seeks for small reads (remember - about 200 op/s max for spinning rust), leading to high busy-ness and low bandwidth.

To an extent, slowness of drives (hw issues) can be evaluated by a large sequential dd to read a large stroke - that should be in tens of MB/s range (numbers depend on HW age and other factors), while highly random i/o's on my pools were often under 1-5MB/s when I was worried.

Also, look at amount of datasets, particularly auto-snapshots. If some tools read those in (e.g. zfs-auto-snapshots or znapzend service doing it in a loop to manage yhe snaps) it can also be a lot of i/o continuously. This can be a lot to read (especially if ARC or L2ARC does not fit this all) and parse; my systems were quite busy with roughly 100k snapshots (about 1000 datasets with 100 snaps), taking a couple of minutes to read the list of snaps alone, and a comparable while to 'beadm list' before andyf's recent fixes.

The DTrace toolkit IIRC had some scripts to build histograms of i/o sizes so you could rule out or confirm lots of scattered small i/o's as the problem. If it is there, and you can point a finger at read-time fragmentation, you might have to migrate data away from this pool onto a new one, without parallelizing the copy routines, so on-disk bits there are laid out sequentially for same files and/or txg commits. This would also let migrate the pool to 4K disks if that is a problem today.

Also, terabytes range is a bit too much for a root pool. Given its criticality for the system and some limitations (device id change hell, caching, etc.) consider also moving data that is not rootfs away to another pool, even if in partitions on same disk or better on other disks. This would keep rpool's scope smaller and changes less frequent, making system better bootable to cope with issues on non-critical pools after the full OS is up. Maybe my write-ups on split-root setups and wrapping zpool imports as smf services can help here, or at least give some more ideas and details to look at.

Good luck,
Jim Klimov

--
Typos courtesy of K-9 Mail on my Android