[OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Tue Mar 16 19:15:02 UTC 2021

AFAIK a scrub or ECC error shouldn't crash the kernel. Also, if the crash
is occurring on the error the error might not be logged. To me it sounds
like you might have a system board issue.

Also FWIW you shouldn't have to scrub otherwise healthy pools more than
once per month.

On Tue, Mar 16, 2021, 14:08 Reginald Beardsley via openindiana-discuss <
openindiana-discuss at openindiana.org> wrote:

> I suspect memory errors on my Sol 10 u8 system, but there are no memory
> errors reported by "fmdump -eV".  All the errors and events are zfs related.
>
> Initial symptom is starting a scrub on a freshly booted system will
> complete properly, but the same operation after the system has been up for
> a few days will cause a kernel panic. Immediately after a reboot a scrub
> will complete normally.  This behavior suggests bit fade to me.
>
> This has been very consistent  for the last few months.  The system is an
> HP Z400 which is 10 years old and generally has run 24x7.  It was certified
> by Sun for Solaris 10 which is why I bought it and uses unbuffered,
> unregistered ECC DDR3 DIMMs.  Since my initial purchase I have bought three
> more Z400s.
>
> Recently the system became unstable to the point I have not been able to
> complete a "zfs send -R" to a 12 TB WD USB drive.  My last attempt using a
> Hipster LiveImage died after ~25 hours.
>
> My Hipster 2017.10 system shows some events which appear to be ECC
> related, but I'm not able to interpret them.  I've attached a file with the
> last such event.  Not sure that will work, but worth trying.  This is from
> my regular internet access host.  So  it is up 24x7 with few exceptions.
>
> Except for the CPU and memory, the machines are almost identical.  The
> Hipster machine is an older 4 DIMM slot machine with the same 3 way mirror
> on s0 and 3 disk RAIDZ1 on s1.  The Sol 10 system is a 6 DIMM slot model
> and has a 3 TB mirrored scratch pool in addition to the s0 & s1 root and
> export pools.
>
> It seems unlikely that I could simply swap the disks between the two, but
> I can install Hipster on a single drive for rpool and attempt to copy the
> scratch pool, spool, with that and simply run it for a while for testing.
>
> I've read everything I can find about the Fault Manager, but it has
> produced more questions than answers.
>
> This is for Hipster 2017.10:
>
> sun_x86%rhb {82} fmadm config
> MODULE                   VERSION STATUS  DESCRIPTION
> cpumem-retire            1.1     active  CPU/Memory Retire Agent
> disk-lights              1.0     active  Disk Lights Agent
> disk-transport           1.0     active  Disk Transport Agent
> eft                      1.16    active  eft diagnosis engine
> ext-event-transport      0.2     active  External FM event transport
> fabric-xlate             1.0     active  Fabric Ereport Translater
> fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
> io-retire                2.0     active  I/O Retire Agent
> sensor-transport         1.1     active  Sensor Transport Agent
> ses-log-transport        1.0     active  SES Log Transport Agent
> software-diagnosis       0.1     active  Software Diagnosis engine
> software-response        0.1     active  Software Response Agent
> sysevent-transport       1.0     active  SysEvent Transport Agent
> syslog-msgs              1.1     active  Syslog Messaging Agent
> zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
> zfs-retire               1.0     active  ZFS Retire Agent
>
>
> It's a little longer than for Sol 10 u8, but the cpumem-retire V 1.1
> appears on both.
>
> Suggestions?
>
> Thanks,
> Reg
>
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss at openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss
>