[OpenIndiana-discuss] Huge ZFS root pool slowdown - diagnose root cause?

Tue Dec 11 16:16:47 UTC 2018

John, Jason,

Many thanks for your brainstorming on this…

> On Dec 10, 2018, at 6:19 PM, John D Groenveld <groenveld at acm.org> wrote:
> 
> In message <4AB4A1DD-5A90-4F9A-B26E-9A71028A02C0 at comcast.net>, Lou Picciano wri
> tes:
>> Is this evidence of erroneous attempts to read boot blocks/loader on disk0?
>> 
>> Given the machine BIOS identification of drives, dunno that I can be absolutel
>> y certain disk0 is referring to one disk - or is the entire rpool seen a
>> s disk0 once the OS is loading?
> 
> Does iostat(1M) -E report errors?

Absolutely none. In fact, having called precisely that command before, I was thrown by the ‘Errors: 0’ everywhere…

sd1       Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA      Product: Hitachi HDS5C302 Revision: A180 Serial No: ML0221F302X0MD 
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd2       Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA      Product: Hitachi HDS5C302 Revision: A580 Serial No: ML0220F30HWBSD 
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 

> Have you tried interrogating the drives via smartctl?
> <URL:https://wiki.openindiana.org/oi/Adding+SMART+disk+monitoring+as+a+SMF+service>

I have now, finally) managed to get perhaps the key bit of reporting from smartctl - does this seem adequately diagnostic?:
(I am fully satisfied to replace the drive; I just want to be sure I’ve run to ground any potential root causes.)

$ pfexec smartctl -a -d sat,12 /dev/rdsk/c2t0d0s0 | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000b   094   094   016    Pre-fail  Always       -       1376259
$ pfexec smartctl -a -d sat,12 /dev/rdsk/c2t1d0s0 | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

Above seems consistent with all the read errors I see at boot.

> Happy hunting,
> John
> groenveld at acm.org
> 

> On Dec 10, 2018, at 6:36 PM, jason matthews <jason at broken.net> wrote:
> 
> Have you tried look see if the drives are accumulating errors?
> 
In other iostat fun I’ve tried before (not very helpful!): 
$ iostat -ien
  ---- errors --- 
  s/w h/w trn tot device
...
    0   0   0   0 rpool
    0   0   0   0 c2t0d0
    0   0   0   0 c2t1d0
...
> 
> if so, pull the bad drive.
> 
> What happens if you go into the boot manager and manually select a boot disk? If the problem is with a single drive, then the other drive should boot normally right? Try booting from both drives select each one manually.

That’s also interesting. With the hundreds of read errors at boot up, the boot manager is never even (visibly) presented. I guess I could try this again from a boot from USB image...
> 
> you can speed up the scrub with:
> 
> echo zfs_scrub_delay/W0x0 |mdb -kw
> 
> echo zfs_scan_min_time_ms/W0x0

Good commands for reference. I was unaware of these! But, even with scrub canceled for the moment, am still seeing virtually continuous drive controller traffic.

You also wanted to see:
$ iostat -nMxC 5
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  962.3    0.0   11.3 15.7  0.2   16.3    0.2   5  23 c2
    0.0  398.4    0.0    4.3  7.1  0.1   17.9    0.2  83   6 c2t0d0
    0.0  415.2    0.0    4.2  8.6  0.1   20.6    0.2  87   9 c2t1d0
    0.0   40.2    0.0    0.7  0.0  0.0    0.0    0.4   0   2 c2t2d0
    0.0   40.4    0.0    0.7  0.0  0.0    0.0    1.1   0   4 c2t3d0
    0.0   34.4    0.0    0.7  0.0  0.0    0.0    0.3   0   1 c2t4d0
    0.0   33.6    0.0    0.7  0.0  0.0    0.0    0.3   0   1 c2t5d0

Again, I assume the symmetry in findings between t0 and t1 is due to their mirrored status… But doesn’t seem to help in differentiating offending device. (For comparison, t2-t5 are the data pool.) There is essential zero ‘user’ activity on either data or root pools...