[oi-dev] SSD-based pools

Wed Oct 1 10:25:13 UTC 2014

30 сентября 2014 г. 22:39:00 CEST, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> пишет:
>On Tue, 30 Sep 2014, Schweiss, Chip wrote:
>
>> 
>> 
>> On Tue, Sep 30, 2014 at 11:52 AM, Bob Friesenhahn
><bfriesen at simple.dallas.tx.us> wrote:
>>
>>             Presumably because the checksum is wrong.
>> If by turning off 'sync' it is meant that the zil is disabled, then
>that has nothing to do with zfs checksums being
>> wrong.  If drive cache flush is disabled for async transaction
>groups, then nothing but problems can result (e.g.
>> failure to import the pool at all).
>> 
>> 
>> I doubt the pool would ever not be importable.  Data loss sure.  ZFS
>will be rolled back to the last completed TXG.  Like I
>> said before, on this pool data loss is not an issue as long as we
>know it's lost.   Losing the entire pool because a power
>> failure is not an issue.   All the processing pipelines using the
>pool at the time would have lost power too and would be
>
>Obviously it does happen since there have been reports to the zfs 
>mailing list over the years about pools which were completely lost 
>because the drive firmware did not honor cache flush requests.
>
>There are only so many (20?) TXG records which are discoverable.
>
>This has nothing to do with zfs tunables though.  Zfs always issues 
>drive cache flush requests for each TXG since otherwise the pool would 
>be insane.
>
>Bob

> There are only so many (20?) TXG records which are discoverable.

128KB in a ring buffer of ZFS structure 'roots', 32 items on 4kb sectored drives, 256 on 512b drives. However there is no guarantee that all of an older tree will be consistent (not overwritten if the block was released and reused in newer transactions that are now ignored due to rollback). Statistically, the older a TXG is, the more likely are such sorts of corruptions upon rollback.

Note that metadata is often written with 2 or 3 copies (beside the raidzN/mirror redundancy) so it is more likely to remain intact (at least one consistent ditto copy), while userdata is typically single-copy and might thus suffer more likely.

Regarding other fatal faults, I've had a raidz2 system made of consumer hardware where who knows what voodoo was happening, but over time previously fine blocks were becoming unrecoverable garbage. That is, either 'all' or 'sufficiently many' of the 6 disks (a 4+2 set) were corrupted in the same offsets. Maybe some signal noise was misinterpreted by all 6 of the firmwares as write-commands. Maybe partitioning the disks in a way that all zfs slices had different sector-offsets could help in this case - if this one of many random guesses of a reason is correct...

In the same manner the pool had errors not only in named files, but also 'metadata:<0x...>' items which might or not have been problematic. In my experience i've had unimportable pools - but mostly on virtual systems that lied about cache-flushing and/or did IOs out of order during a kernel/power failure.

More often there were time-consuming operations (like large deletions on/of a deduped dataset) that were 'backgrounded' while a pool was 'live', and became foreground prerequisites for a pool import after a reboot. And these could take days to complete. And many reboots if the operation required more RAM than the system had...

HTH,
Jim
--
Typos courtesy of K-9 Mail on my Samsung Android