[OpenIndiana-discuss] The kiss of death

Fri Apr 23 11:49:51 UTC 2021

 Hmmm,

That suggests that I was perhaps too optimistic about using disks >2 TB despite having successfully installed 2020.10 on a single slice 5 TB disk.

I've got S11.4 installed and running again.

Reg

     On Friday, April 23, 2021, 01:26:37 AM CDT, Toomas Soome <tsoome at me.com> wrote:  

On 23. Apr 2021, at 01:57, Reginald Beardsley via openindiana-discuss <openindiana-discuss at openindiana.org> wrote:

What do those mean? I have seen them numerous times even though the system booted.

On Thursday, April 22, 2021, 05:46:30 PM CDT, Nelson H. F. Beebe <beebe at math.utah.edu> wrote:

 ZFS: i/o error - all block copies unavailable

I/O error means what it is telling us - we did encounter error while doing I/O, in this case, we were attempting to read disk. I/O error may come from BIOS INT13:
usr/src/boot/sys/boot/i386/libi386/biosdisk.c:
function bd_io() message template is:
printf("%s%d: Read %d sector(s) from %lld to %p “ … );
You should see disk name as disk0: etc.
or UEFI case in usr/src/boot/sys/boot/efi/libefi/efipart.c:
function efipart_readwrite() and the message template is:
printf("%s: rw=%d, blk=%ju size=%ju status=%lu\n” …)

With ZFS reader, we can also get I/O error because of failing checksum check. That is, we did issue read command to disk, we did not receive error from that read command, but the data checksum does not match.
Checksum error may happen in scenarios:
1. the data actually is corrupt (bitrot, partial write…)2. the disk read did not return error, but did read wrong data or does not actually return any data. This is often case when we meet 2TB barrier and BIOS INT13 implementation is buggy.
We can see this case with large disks, the setup used to be booting fine, but as the rpool has been filling up, at one point in time (after pkg update), the vital boot files (kernel or boot_archive) are written past 2TB “line”. Then next boot will attempt to read kernel or boot_archive and will get error.
3. BUG in loader ZFS reader code. If by some reason the zfs reader code will instruct disk layer to read wrong blocks, the disk IO is most likely OK, but the logical data is not. I can not exclude this cause, but it is very unlikely case.
When we do get IO error, and the pool has redundancy (mirror, raidz or zfs set copies=..), we attempt to read alternate copy of the file system block, if all reas fail, we get the second half of this message (all block copies).
Unfortunately, the current ZFS reader code does report generic EIO from its internal stack and therefore this error message is very generic one. I do plan to fix this with decryption code addition.

 ZFS: can't read MOS of pool rpool

This means, we got IO errors while we are opening pool for reading, we got pool label (there are 4 copies of pool label on disk), but we have error while attempting to read MOS (this is sort of like superblock in UFS), without MOS we can not continue reading this pool.
If this is virtualized system and the error does appear after reboot and the disk is not over 2TB one, one possible reason is VM managed failing to write down virtual disk blocks. This is the cache when VM manager is not implementing the cache flush for disks and, for example, is crashing. I have seen this myself with VMWare Fusion.
rgds,toomas