[OpenIndiana-discuss] Checksum errors on high-volume / high-speed writes

Richard Elling richard.elling at richardelling.com
Sat Aug 17 22:36:46 UTC 2013


Hi Willem,

On Aug 14, 2013, at 10:49 AM, wim at vandenberge.us wrote:

> Good morning,
> Last week we put three identical oi_151a7 systems into pre-production. Each
> system has 240 drives in 9drive RAIDZ1 vdevs (I'm aware of the potential DR
> issues with this configuration and I'm ok with them in this case). The drives
> are Seagate Enterprise nearline SAS, 7200RPM. The servers are all identical
> Supermicro servers with dual 4C Xeons, maxed out memory and LSI 9200-8E HBA's.
> Intel MLC ssd for boot and cache, Intel SLC for ZIL. All of these are components
> we have used many times before without issue.
> 
> While loading up the systems with data we started to see low numbers of checksum
> errors across all drives. The first time we saw it we pulled the drives and low
> level tested them, no errors. Scrub finds no issues. iostat -EXN shows no hard,
> soft or transport errors. iostat -xnz shows no anomalous drives.

Also check "fmdump -eV" for details on the checksum mismatches. This analysis can
be tedious, but necessary to root cause the issue.

> 
> During the load test we're pushing between 14 and 16Gb/sec to each system and
> the CPU load average does go up significantly (about 8) but that is to be
> expected with a RAIDZ1 volume this big and this busy.
> 
> I don't want to put the systems into prodcution until I figure out if I have a
> problem or not. Thoughts / ideas?

The usual suspects in these cases are:
	+ Power supplies
	+ HBAs
	+ memory (ECC really helps for large systems)


 -- richard

-- 

ZFS storage and performance consulting at http://www.RichardElling.com









More information about the OpenIndiana-discuss mailing list