[OpenIndiana-discuss] Zfs stability "Scrubs"

Richard Elling richard.elling at richardelling.com
Tue Oct 16 00:02:28 UTC 2012


On Oct 15, 2012, at 3:00 PM, heinrich.vanriel at gmail.com wrote:

> Most of my storage background is with EMC CX and VNX  and that is used in a vast amount of datacenters. 
> They run a process called sniiffer that runs in the background and request a read of all blocks on each disk individually for a specific LUN, if there is an unrecoverable read error a Background Verify (BV) is requested by the process to check for data consistency. The unit will also conduct a proactive copy to a hotspare, I believe once data has been verified, from the disk where the error(s) were seen.
> 
> A BV is also requested when there is a LUN failover, enclosure path failure or a storage processor failure.
> 
> 
> My point is most high end storage units has some form of data verification process that is active all the time. 

Don't assume BV is data verification. On most midrange- systems these scrubbers just
check for disks to report errors. While this should catch most media errors, it does not
catch phantom writes or other corruption in the datapath. On systems with SATA disks, 
there is no way to add any additional checksums to the sector, so they are SOL if there
is data corruption that does not also cause a disk failure. For SAS or FC disks, some
vendors use larger sectors and include per-sector checksums that can help catch
some phantom write or datapath corruption.

There is some interesting research that shows how scrubs for RAID-5 systems can 
contaminate otherwise good data. The reason is that if a RAID-5 parity mismatch
occurs, how do you know where the data corruption is when the disks themselves 
do not fail. In those cases, scrubs are evil. ZFS does not suffer from this problem because
the checksums are stored in the parent's metadata.

> In my opinion scrubs should be considered depending on the importance of data and the frequency based on what type of raidz, change rates and disk type used. 
> 
> Perhaps in future ZFS will have the ability to limit resource allocation when scrubbing like with BV where it can be set. Rebuild priory can also be set.

Throttling exists today, but most people don't consider mdb as a suitable method for "setting" :-(
Scrub priority is already lowest priority, I don't see much need to increase it.
 -- richard

> Also some high end controllers have "port" verify for each disk (media read) when using their integrated raid that runs periodically. Since in the world of ZFS it is recommended to use JBOD I see it as more than just the filesystem. I have never deployed a system containing mission critical data using filesystem raid protection other than with ZFS since there is no protection in them an I would much rather bank on the controller.
> 
> 
> 
> my few cents on scrubs. 
> 
> 
> 
> Thanks
> 
> 
> 
> 
> 
> From: Jim Klimov
> Sent: ‎October‎ ‎13‎, ‎2012 ‎9‎:‎02
> To: Discussion list for OpenIndiana
> Subject: Re: [OpenIndiana-discuss] Zfs stability "Scrubs"
> 
> 
> 2012-10-13 7:26, Michael Stapleton wrote:
>> The VAST majority of data centers are not storing data in storage that
>> does checksums to verify data, that is just the reality. Regular backups
>> and site replication rule.
> 
> And this actually concerns me... we help maintain some deployments
> built by customers including professional arrays like Sun Storagetek
> 6140 serving a few LUNs to directly attached servers (so it happens).
> 
> The arrays are black boxes to us - we don't know if they use
> something block-checksummed similar to ZFS inside, or can only
> protect against whole-disk failures, when a device just stops
> responding?
> 
> We still have little idea - in what config would the data be
> safer to hold a ZFS pool, and which should give more performance:
> * if we use the array with its internal RAID6, and the client
>   computer makes a pool over the single LUN
> * a couple of RAID6 array boxes in a mirror provided by arrays'
>   firmware (independently of client computers, who see a MPxIO
>   target LUN), and the computer makes a pool over the single
>   multi-pathed LUN
> * a couple of RAID6 array boxes in a mirror provided by ZFS
>   (two independent LUNs mirrored by computer)
> * serve LUNs from each disk in JBOD manner from the one or two
>   arrays, and have ZFS construct pools over that.
> 
> Having expensive hardware RAIDs (anyway available on customer's
> site) serving as JBODs is kind of overkill - any well-built JBOD
> costing a fraction of this array could suffice. But regarding
> data integrity known to be provided by ZFS and unknown to be
> really provided by black-box appliances, downgrading the arrays
> to JBODs might be better. Who knows?.. (We don't, advice welcome).
> 
> 
> 
> There are several more things to think about:
> 
> 1) Redundant configs without knowledge of which side of the mirror
>    is good, or what permutation of RAID blocks yields the correct
>    answer, is basically useless, and it can propagate errors by
>    overwriting an unknownly-good copy of the data with unknownly-
>    corrupted one.
> 
>    For example, take a root mirror. You find that your OS can't
>    boot. You can try to split the mirror into two separate disks,
>    fsck each of them and if one is still correct, recreate the
>    mirror using it as base (first half). Even if both disks give
>    some errors, these might be in different parts of the data, so
>    you have a chance of reconstructing the data using these two
>    halves and/or backups. However, if your simplistic RAID just
>    copies data from disk1 to disk2 in case of any discrepancies
>    and unclean shutdowns, you're roughly 50% likely to corrupt a
>    good disk2 with bad data from disk1.
> 
>    This setup assumed that bit-rot never occurred or was too rare,
>    bus/RAM errors never happened or were ruled out by CRC/ECC,
>    and instead disks died altogether, instantly becoming bricks
>    (which could be quite true in the old days, and can still be
>    probable with expensive enterprise hardware). Basically, this
>    assumed that data written from a process was the same data that
>    hit the disk platters and the same data that was returned upon
>    reads (unless an IO error/deviceMissing were reported) - in that
>    case old RAIDs could indeed propagate assumed-good data onto
>    replacement disk(s) during reconstruction of the array.
> 
> 2) Backups and replicas without means to verify them (checksums
>    or at least three-way comparisons at some level) are also
>    tainted, because you don't really know if what you read from
>    them ever matches what you wrote to them (perhaps several years
>    ago, counting from the moment the data was written onto RAID
>    originally).
> 
> My few cents,
> //Jim
> 
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss at openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

--

Richard.Elling at RichardElling.com
+1-760-896-4422





More information about the OpenIndiana-discuss mailing list