[OpenIndiana-discuss] Huge ZFS root pool slowdown - diagnose root cause?

Mon Dec 10 16:10:38 UTC 2018

Really need some feedback from The Experts here…

We have a root pool which has started to run very slowly…

Evidence? 
- originally, only indication was that there seemed to be nearly-continuous drive controller traffic. (the pool is nowhere near full…)
- scrub pool has taken about 5 days to scrub only a few hundred GB of this 2TB pool (good news, however, is that no errors are found. Can this be trusted?)
  (at that rate, it would take at least another week to finish the scrub…)

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c2t0d0s0  ONLINE       0     0     0
            c2t1d0s0  ONLINE       0     0     0

Boot process has become excrutiatingly slow, and worrisome. Immediately after Loading OS message, we get hundreds of messages like:

	disk0: Read 8 sector(s) from <really_long address?> to 0xffffe000 (ox8000): 0x1

Is this evidence of erroneous attempts to read boot blocks/loader on disk0?

Given the machine BIOS identification of drives, dunno that I can be absolutely certain disk0 is referring to ‘one’ disk - or is the entire rpool seen as disk0 once the OS is loading?

Machine does eventually boot, however - takes about 20 mins! Recent Hipster updates (2018-11-27) have been applied. System otherwise runs quite well. Most client data is on datapool; they remain oblivious. (To be honest, they were oblivious before this…(!) )

$ iostat -D 1 5
   backup       datapool       rpool          sd1       
rps wps util  rps wps util  rps wps util  rps wps util  
  2   3  1.3    8 131  9.3   87  89 99.6   44  45 73.0  
  0   0  0.0    0 645 18.3  105   0 99.9   52   0 71.2  
  0   0  0.0    0  16  0.1   98 107 100.0   52  54 91.0  
  0   0  0.0    0   0  0.0   31 373 100.0   13 190 95.0  
  0   0  0.0    0   0  0.0   54  14 100.0   24  12 48.8

Some specific questions:

1) How can I definitively diagnose which of the pool disks is the bad one? Seems obvious, but is it?

2) Is this a matter of corrupted boot blocks on one drive, being compensated for by ‘good blocks’ on the other?

3) These are SATA disks; I am about to try the ‘hot swap’ in situ approach; is it safe to do this with questionable boot blocks?:

# zpool offline c2t0d0s0
# cfgadm unconfigure sata0/0::dsk/c2t0d0
 — swap — 
# cfgadm configure sata0/0::dsk/c2t0d0
# zpool online rpool c2t0d0s0
# zpool replace pool c2t0d0s0

Tks for any insights,

Lou Picciano