[OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Tue Mar 16 21:58:41 UTC 2021

Guys, thanks for the comments.

I have confirmed through a friend who has designed and built a Zynq based DSO with DDR3 memory that the ECC is only computed on a read operation which is what I had always assumed. So if a pointer table initialized at boot time only gets accessed when you do a scrub it would exactly match the symptoms. The kernel panic would prevent fmd from logging an ECC event. If all the bits fade to zero and you dereference the pointer in the kernel there is no way to prevent a kernel panic.

I have completed a scrub of all 3 pools in the Sol 10 u8 system. All pools and vdevs are clean, no errors. I am going to leave it idling until tomorrow and start another scrub of the root pool which is the smallest. My expectation is it will kernel panic. I'll save the core file and see what I can divine from that.

I'm also going to make up a cable that will let me look at the PSU rails under load with a scope. As the Z400s have been such fine machines I don't see myself getting rid of them. Though I am flirting with getting a Z820 with 10-20 cores and 256-512 GB of RAM. Not that long ago I could have made a very good living off of such a machine processing seismic data. But now I shall need another job to justify feeding it electricity and cooling. It blows my mind that 20x 3 GHz cores, 512 GB of RAM and 30+ TB of triple parity RAIDZ is less than what I paid for my Ultra 20.

I've got a cheap Chinese PSU tester, but a DSO will do a much better job. With a modest amount of fiddle I can set up a repeatable PSU test to be done once a year. I also have an EDS-88A in-circuit cap tester and HP 4884A & 4285A LCR meters. So I'm rather heavy on the T&M kit. in the mid 90's my lab gear would have cost around $500k. All bought for pennies on the dollar via ebay and the T&M repair lists. I'm rather in awe of what it can do.

 I know I can record at least 20 million samples, possibly more. I have several DSOs as well as analog scopes. So I'll set up a DSO to capture a singleshot trace when I start the scrub. Just in case there is a transient event.

I think it worth noting that the most recent ECC error was from 3 years ago. I don't recall ever having a kernel panic on this system which is running Hipster 2017.10 and what I am using as I type this. And I have never had a POST error reported on any system. Strangely, the long POST times went away at least on one of the 2x 4 slot machines which is now my dedicated OS test and Windows/Linux machine.

The HP BIOS is maddeningly opaque. I am planning to build the EFI 2.0 shell on a USB stick as that will give me the functionality of a traditional ROM monitor program. 

I'm actually planning to replace the 2 GB DIMMs with 4 or 8 GB DIMMS in at least a couple of the Z400s. But it's become a grudge match with the machine. I want to find the bad DIMM. And I want to be able to do it easily and reliably at any time in the future even though I've run into the issue 2 times in 30 years and am statistically unlikely to live long enough to have it happen a 3rd time. I don't like letting machines mess with me.

Have Fun!
Reg