[OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Udo Grabowski (IMK) udo.grabowski at kit.edu
Tue Mar 16 20:14:33 UTC 2021



On 16.03.21 20:08, Reginald Beardsley via openindiana-discuss wrote:
> I suspect memory errors on my Sol 10 u8 system, but there are no memory errors reported by "fmdump -eV".  All the errors and events are zfs related.
> 
> Initial symptom is starting a scrub on a freshly booted system will complete properly, but the same operation after the system has been up for a few days will cause a kernel panic. Immediately after a reboot a scrub will complete normally.  This behavior suggests bit fade to me.
> 
> This has been very consistent  for the last few months.  The system is an HP Z400 which is 10 years old and generally has run 24x7.  It was certified by Sun for Solaris 10 which is why I bought it and uses unbuffered, unregistered ECC DDR3 DIMMs.  Since my initial purchase I have bought three more Z400s.
> 
> Recently the system became unstable to the point I have not been able to complete a "zfs send -R" to a 12 TB WD USB drive.  My last attempt using a Hipster LiveImage died after ~25 hours.
> 
> My Hipster 2017.10 system shows some events which appear to be ECC related, but I'm not able to interpret them.  I've attached a file with the last such event. 
 >....

Clearly a memory error, second DIMM of first DRAM channel of CPU 0
(where ever that physically is located is another question...).

If I recall correctly some of your last mails, you had varying boot
times, which indicate that the BIOS internal memtest routine stumbled
across that module. If it once will be sufficiently damaged, the BIOS
should isolate that itself and you'll find yourself with one DIMM memory
less (seen in 'top'). But that could take a long time. The current error
was correctable, so it should not have an immediate impact to your
RAM content.
I would rather suspect some additional issues with your voltages, since
usually, after ten years, the caps of the power supply degrade 
sufficently to generate fluctuations. But still, the BIOS protections
(if there are any) should chime in here if something really goes
haywire. At least, there should be a BIOS page that shows you the
essential voltages (or monitor them with ipmitool sensor, if it has a
BMC). This problem will certainly be load-dependent.

The symptoms you describe usually cannot happen from defect ECC memory,
only non-ECC DIMMs can cause such hassles.



More information about the openindiana-discuss mailing list