[OpenIndiana-discuss] Checksum errors on high-volume / high-speed writes
Richard Elling
richard.elling at richardelling.com
Sat Aug 17 22:36:46 UTC 2013
Hi Willem,
On Aug 14, 2013, at 10:49 AM, wim at vandenberge.us wrote:
> Good morning,
> Last week we put three identical oi_151a7 systems into pre-production. Each
> system has 240 drives in 9drive RAIDZ1 vdevs (I'm aware of the potential DR
> issues with this configuration and I'm ok with them in this case). The drives
> are Seagate Enterprise nearline SAS, 7200RPM. The servers are all identical
> Supermicro servers with dual 4C Xeons, maxed out memory and LSI 9200-8E HBA's.
> Intel MLC ssd for boot and cache, Intel SLC for ZIL. All of these are components
> we have used many times before without issue.
>
> While loading up the systems with data we started to see low numbers of checksum
> errors across all drives. The first time we saw it we pulled the drives and low
> level tested them, no errors. Scrub finds no issues. iostat -EXN shows no hard,
> soft or transport errors. iostat -xnz shows no anomalous drives.
Also check "fmdump -eV" for details on the checksum mismatches. This analysis can
be tedious, but necessary to root cause the issue.
>
> During the load test we're pushing between 14 and 16Gb/sec to each system and
> the CPU load average does go up significantly (about 8) but that is to be
> expected with a RAIDZ1 volume this big and this busy.
>
> I don't want to put the systems into prodcution until I figure out if I have a
> problem or not. Thoughts / ideas?
The usual suspects in these cases are:
+ Power supplies
+ HBAs
+ memory (ECC really helps for large systems)
-- richard
--
ZFS storage and performance consulting at http://www.RichardElling.com
More information about the OpenIndiana-discuss
mailing list