[OpenIndiana-discuss] Need help decoding fmd fault/error

Thu Jul 31 21:00:02 UTC 2014

*** Up front, my apologizes for this long post but I found that most
forums will ask for more detail so I thought I try to provide it up
front. ***

I have a server at home running oi 151a9 and arrived home to find the
system locked up.  Keyboard and network unresponsive. So I rebooted.

As a quick aside, a little background on this box.  Its running a
SuperMicro X8SAX mother board with two SuperMicro MV88SX6081 8-port SATA
II PCI-X Controllers for several years.  Between the two controllers, I
have (9) 1TB drives.  In the past month or so, I'll have one drive quit
responding to the daily smartd short test and go off line.  The simplest
way to fix this is to cold boot.  After a little resilvering and the
raidz2 pool is back and running.  This gave me the impression that I had
one of the drives slowly going down hill until today.  This morning was
one of those days that I woke up and had to cold boot the server to get
a sleepy drive going again.

So I started digging for the error.

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jul 31 15:11:51 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe  PCIEX-8000-DJ  Major

Host        : nas2
Platform    : X8SAX     Chassis_id  : 1234567890
Product_sn  :

Fault class : fault.io.pciex.device-noresp 40%
              fault.io.pciex.device-interr 40%
              fault.io.pciex.bus-noresp 20%
Affects     : dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
                  faulted but still in service
FRU         : "MB" (hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0)
                  faulty

Description : A problem has been detected on one of the specified devices or on
              one of the specified connecting buses.
              Refer to http://illumos.org/msg/PCIEX-8000-DJ for more
              information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device(s).  Use fmadm faulty to identify the devices or
              contact Sun for support.

So it looks like the mother board is at fault. Fascinating!

Digging a little deeper, I found this.

# fmdump -Vp -u 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe
TIME                           UUID                                 SUNW-MSG-ID
Jul 31 2014 15:11:51.860270000 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe PCIEX-8000-DJ

  TIME                 CLASS                                 ENA
  Jul 31 08:11:31.2377 ereport.io.pciex.tl.cto               0x051796e680200001

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe
        code = PCIEX-8000-DJ
        diag-time = 1406833911 103306
        de = fmd:///module/eft
        fault-list-sz = 0x3
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.io.pciex.device-noresp
                certainty = 0x28
                resource = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
                asru = dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
                fru = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0
                location = MB
        (end fault-list[0])
        (start fault-list[1])
        nvlist version: 0
                version = 0x0
                class = fault.io.pciex.device-interr
                certainty = 0x28
                resource = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
                asru = dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
                fru = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0
                location = MB
        (end fault-list[1])
        (start fault-list[2])
        nvlist version: 0
                version = 0x0
                class = fault.io.pciex.bus-noresp
                certainty = 0x14
                resource = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
                asru = dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
                fru = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0
                location = MB
        (end fault-list[2])

        fault-status = 0x1 0x1 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x53da94f7 0x3346adb0

So it still looks like its the mother board or at least on the mother
board.
Digging deeper still....

# fmdump -e
.....
Jul 31 04:48:52.8973 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:48:52.9791 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:48:52.9792 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:48:52.9793 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:06.4773 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:06.4852 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:06.4773 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:06.4853 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:07.0540 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:07.0541 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:07.0542 ereport.io.scsi.cmd.disk.tran   
Jul 31 04:53:07.0543 ereport.io.scsi.cmd.disk.tran   
Jul 31 08:06:59.8940 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8940 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum         
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum         
Jul 31 08:11:31.2377 ereport.io.pci.fabric           
Jul 31 08:11:31.2377 ereport.io.pciex.tl.cto         
Jul 31 08:11:31.2377 ereport.io.pci.fabric           
Jul 31 08:11:31.2377 ereport.io.pciex.rc.nfe-msg     
Jul 31 08:11:31.2378 ereport.io.pci.fabric           
Jul 31 08:11:31.2377 ereport.io.pciex.rc.nfe-msg     
Jul 31 08:11:31.2378 ereport.io.pci.fabric        

# fmdump -eVt 08:11 
.....
Jul 31 2014 08:11:31.237826547 ereport.io.pci.fabric
nvlist version: 0
        class = ereport.io.pci.fabric
        ena = 0x51796f92d000001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,3408 at 1/pci8086,32c at 0/pci11ab,11ab at 1
        (end detector)

        bdf = 0x208
        device_id = 0x6081
        vendor_id = 0x11ab
        rev_id = 0x9
        dev_type = 0x101
        pcie_off = 0x0
        pcix_off = 0x60
        aer_off = 0x0
        ecc_ver = 0x0
        pci_status = 0x2b8
        pci_command = 0x157
        pcix_status = 0x1830208
        pcix_command = 0x30
        remainder = 0x1
        severity = 0x1
        __ttl = 0x1
        __tod = 0x53da3273 0xe2cf1f3
Jul 31 2014 08:11:31.237790738 ereport.io.pciex.rc.nfe-msg
nvlist version: 0
        ena = 0x51796f06ef00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0
        (end detector)

        class = ereport.io.pciex.rc.nfe-msg
        rc-status = 0x24
        source-id = 0x100
        source-valid = 1
        __ttl = 0x1
        __tod = 0x53da3273 0xe2c6612

Jul 31 2014 08:11:31.237838941 ereport.io.pci.fabric
nvlist version: 0
        class = ereport.io.pci.fabric
        ena = 0x51796fc33a00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,3408 at 1/pci8086,32c at 0/pci11ab,11ab at 2
        (end detector)

        bdf = 0x210
        device_id = 0x6081
        vendor_id = 0x11ab
        rev_id = 0x9
        dev_type = 0x101
        pcie_off = 0x0
        pcix_off = 0x60
        aer_off = 0x0
        ecc_ver = 0x0
        pci_status = 0x2b0
        pci_command = 0x157
        pcix_status = 0x1830210
        pcix_command = 0x30
        remainder = 0x0
        severity = 0x1
        __ttl = 0x1
        __tod = 0x53da3273 0xe2d225d

So looking for the devices in the 'device-path' from the above two
messages in 'prtconf -v', I find it referring to the HBAs and disk
drives. (I didn't include the 'prtconf -v' as this is getting rather
long.)

I'm left at this point not knowing where to point the finger.  Is it the
mother board and/or bus? The HBAs? and/or the disks?  Am I looking in
the right spots?

Any assistance in understanding this is appreciated.

Thanks!
-- 
Scott LeFevre
317-696-1010