[OpenIndiana-discuss] General ZFS questions (Michelle Knight)

Thu Jan 13 14:31:03 UTC 2011

I must agree that I'm not entirely satisfied with the error handling of 
ZFS. I have a raidz2 storage pool consisting of 8 1.5TB hard drives.

While I've never had any checksum errors explicitly reported when 
issuing zpool status command, most scrubs have led to a certain amount 
of Kilobytes of data having to be repaired on one or two drives.

I would be grateful if someone clarified whether this is what Michelle 
Knight is referring to or the numbers in the CKSUM column.

By last fall I was affected by a number of freezes when accessing the 
pool over the network. More thorough tests revealed that it was just the 
storage pool that froze and not the entire system. 'zpool status' 
reported no errors, nor did iostat -En. I didn't know what to do until 
someone pointed out that there may be a performance bug in my build 
(OpenSolaris b134) that has been fixed in later builds. The problems 
started to grow and became unbearable so after a lot of hesitation I 
decided to upgrade the system to b148. Apart from a blacked out screen 
and messed up keyboard layout the rest of the system seemed to be 
working properly.

Once again after the upgrade I ran some tests and 'iostat -En' suddenly 
reported errors on one drive. I replaced it and did a resilver. By the 
end of the resilvering process I noticed that a resilvering was taking 
place on another drive and 'iostat -En' indicated some errors on that 
drive. However the scrub I did afterwards detected no errors and no 
repairs were done. The storage pool has worked fine since then and I no 
longer suffer from any freezes.

The troubleshooting procedure would have been a lot easier if there was 
some monitoring routines that report when a drive takes longer than 
usual to respond and some SMART diagnostics to apply on each individual 
drive whenever there is a suspicion that one drive is faulty.

On 2011-01-13 13:58, Edward Ned Harvey wrote:
>> From: Gary Gendel [mailto:gary at genashor.com]
>>
>> Though I generally agree with the advice that you received, I take an
>> exception to the statement that checksums "always" indicate hardware or
>> driver failure.  As a long-time raidz user, I can attest to the fact
>> that, occasionally, changes to the zfs code have wreaked havoc with
>> raidz.  In one build I tried an even number of disks works but an odd
>> number produces checksum errors.
>>
>> That said...  Have you run scrubs back to back?  If so, do you always
>> get checksum errors and do they occur in different places? On a similar
>> problem such as yours, I discovered that the scrub was introducing
>> checksum errors.  I tracked it down to elevated temperatures during the
>> scrub, which really pounds at the disks.  Better cooling on the disk
>> tower resolved the problem.
> While it's true that no software in the world is perfect and therefore there
> must exist bugs in ZFS and therefore I was overstated in saying cksum errors
> are always caused by hardware...
>
> The example you gave serves the opposite purpose.  If the root cause of your
> problem was elevated temperatures, that's a hardware problem.
>
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>