[OpenIndiana-discuss] Zfs stability "Scrubs"

Sat Oct 13 09:55:36 UTC 2012

A few more comments:

2012-10-13 11:56, Roel_D wrote:
> Thank you all for the good answers!
>
> So if i put it all together :
> 1. ZFS is, in mirror and RAID configs, the best currently available option for reliable data

Yes, though even it is not replacement for backups, because
data loss can be caused by reasons outside ZFS control,
including admin errors, datacenter fires, code bugs and so on.

> 2. Without scrubs data is checked on every read for integrity

With normal reads, this check only takes place for the one
semi-randomly chosen copy of the block. If this copy is not
valid, other copies are consulted.

> 3. Unread data will not be checked for integrity
> 4. Scrubs will solve point 3.

Yes, because they enforce reads and checks of all copies.

> 5. Real servers with good hardware (HCL), ECC memory and servergrade harddisks have a very low chance of dataloss/corruption when used with ZFS.

Put otherwise, cheaper hardware tends to cause problems
of various nature, that can not be detected and fixed by
this hardware and corrupted data is propagated to ZFS
and it trustily saves trash to disks. Few programs do
verify-on-write to test the saved results...

> 6. Large modern drives with large storage like any > 750 GB hd have a higher chance for corruption

The bit-error rates are somewhat the same for disks of the
past decade, being roughly one bit per 10Tb of IOs. With
disk sizes and overall throughputs growing, the chance of
hitting an error on a particular large disk increases.

> 7. Real SAS and SCSi drives offer the best option for reliable data
> 8. So called near-line SAS drives can give problems when combined with ZFS because they haven't been tested very long

There are also some architectural things and lessons learned,
like "don't use SATA disks with SAS expanders", while direct
attachment of SATA disks to individual HBA ports works without
problems (i.e. Sun Thumpers are built like this - with six
eight-port HBAs on board to drive the 48 disks in the box).

> 9. Checking your logs for hardware messages should be a daily job

Better yet, some monitoring system (nagios, zabbix, whatever)
should check these logs so you have one dashboard for all your
computers with a big green light on it, meaning no problems
detected anywhere. You can worry if the light goes not-green ;)
You should manually check the system with drills too, to test
that it itself monitors stuff correctly, though - but that
can be a non-daily routine.

//Jim