[OpenIndiana-discuss] Need help decoding fmd fault/error
Scott LeFevre
slefevre at indy.rr.com
Thu Jul 31 21:00:02 UTC 2014
*** Up front, my apologizes for this long post but I found that most
forums will ask for more detail so I thought I try to provide it up
front. ***
I have a server at home running oi 151a9 and arrived home to find the
system locked up. Keyboard and network unresponsive. So I rebooted.
As a quick aside, a little background on this box. Its running a
SuperMicro X8SAX mother board with two SuperMicro MV88SX6081 8-port SATA
II PCI-X Controllers for several years. Between the two controllers, I
have (9) 1TB drives. In the past month or so, I'll have one drive quit
responding to the daily smartd short test and go off line. The simplest
way to fix this is to cold boot. After a little resilvering and the
raidz2 pool is back and running. This gave me the impression that I had
one of the drives slowly going down hill until today. This morning was
one of those days that I woke up and had to cold boot the server to get
a sleepy drive going again.
So I started digging for the error.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 31 15:11:51 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe PCIEX-8000-DJ Major
Host : nas2
Platform : X8SAX Chassis_id : 1234567890
Product_sn :
Fault class : fault.io.pciex.device-noresp 40%
fault.io.pciex.device-interr 40%
fault.io.pciex.bus-noresp 20%
Affects : dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
faulted but still in service
FRU : "MB" (hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0)
faulty
Description : A problem has been detected on one of the specified devices or on
one of the specified connecting buses.
Refer to http://illumos.org/msg/PCIEX-8000-DJ for more
information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device(s). Use fmadm faulty to identify the devices or
contact Sun for support.
So it looks like the mother board is at fault. Fascinating!
Digging a little deeper, I found this.
# fmdump -Vp -u 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe
TIME UUID SUNW-MSG-ID
Jul 31 2014 15:11:51.860270000 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe PCIEX-8000-DJ
TIME CLASS ENA
Jul 31 08:11:31.2377 ereport.io.pciex.tl.cto 0x051796e680200001
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 5d4cac00-fad0-41fa-d52e-c6077ae5b4fe
code = PCIEX-8000-DJ
diag-time = 1406833911 103306
de = fmd:///module/eft
fault-list-sz = 0x3
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.pciex.device-noresp
certainty = 0x28
resource = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
asru = dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
fru = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0
location = MB
(end fault-list[0])
(start fault-list[1])
nvlist version: 0
version = 0x0
class = fault.io.pciex.device-interr
certainty = 0x28
resource = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
asru = dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
fru = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0
location = MB
(end fault-list[1])
(start fault-list[2])
nvlist version: 0
version = 0x0
class = fault.io.pciex.bus-noresp
certainty = 0x14
resource = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
asru = dev:////pci@0,0/pci8086,3408@1/pci8086,32c@0
fru = hc://:product-id=X8SAX:server-id=nas2:chassis-id=1234567890/motherboard=0
location = MB
(end fault-list[2])
fault-status = 0x1 0x1 0x1
severity = Major
__ttl = 0x1
__tod = 0x53da94f7 0x3346adb0
So it still looks like its the mother board or at least on the mother
board.
Digging deeper still....
# fmdump -e
.....
Jul 31 04:48:52.8973 ereport.io.scsi.cmd.disk.tran
Jul 31 04:48:52.9791 ereport.io.scsi.cmd.disk.tran
Jul 31 04:48:52.9792 ereport.io.scsi.cmd.disk.tran
Jul 31 04:48:52.9793 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:06.4773 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:06.4852 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:06.4773 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:06.4853 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:07.0540 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:07.0541 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:07.0542 ereport.io.scsi.cmd.disk.tran
Jul 31 04:53:07.0543 ereport.io.scsi.cmd.disk.tran
Jul 31 08:06:59.8940 ereport.fs.zfs.checksum
Jul 31 08:06:59.8940 ereport.fs.zfs.checksum
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum
Jul 31 08:06:59.8939 ereport.fs.zfs.checksum
Jul 31 08:11:31.2377 ereport.io.pci.fabric
Jul 31 08:11:31.2377 ereport.io.pciex.tl.cto
Jul 31 08:11:31.2377 ereport.io.pci.fabric
Jul 31 08:11:31.2377 ereport.io.pciex.rc.nfe-msg
Jul 31 08:11:31.2378 ereport.io.pci.fabric
Jul 31 08:11:31.2377 ereport.io.pciex.rc.nfe-msg
Jul 31 08:11:31.2378 ereport.io.pci.fabric
# fmdump -eVt 08:11
.....
Jul 31 2014 08:11:31.237826547 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0x51796f92d000001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci at 0,0/pci8086,3408 at 1/pci8086,32c at 0/pci11ab,11ab at 1
(end detector)
bdf = 0x208
device_id = 0x6081
vendor_id = 0x11ab
rev_id = 0x9
dev_type = 0x101
pcie_off = 0x0
pcix_off = 0x60
aer_off = 0x0
ecc_ver = 0x0
pci_status = 0x2b8
pci_command = 0x157
pcix_status = 0x1830208
pcix_command = 0x30
remainder = 0x1
severity = 0x1
__ttl = 0x1
__tod = 0x53da3273 0xe2cf1f3
Jul 31 2014 08:11:31.237790738 ereport.io.pciex.rc.nfe-msg
nvlist version: 0
ena = 0x51796f06ef00001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci at 0,0
(end detector)
class = ereport.io.pciex.rc.nfe-msg
rc-status = 0x24
source-id = 0x100
source-valid = 1
__ttl = 0x1
__tod = 0x53da3273 0xe2c6612
Jul 31 2014 08:11:31.237838941 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0x51796fc33a00001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci at 0,0/pci8086,3408 at 1/pci8086,32c at 0/pci11ab,11ab at 2
(end detector)
bdf = 0x210
device_id = 0x6081
vendor_id = 0x11ab
rev_id = 0x9
dev_type = 0x101
pcie_off = 0x0
pcix_off = 0x60
aer_off = 0x0
ecc_ver = 0x0
pci_status = 0x2b0
pci_command = 0x157
pcix_status = 0x1830210
pcix_command = 0x30
remainder = 0x0
severity = 0x1
__ttl = 0x1
__tod = 0x53da3273 0xe2d225d
So looking for the devices in the 'device-path' from the above two
messages in 'prtconf -v', I find it referring to the HBAs and disk
drives. (I didn't include the 'prtconf -v' as this is getting rather
long.)
I'm left at this point not knowing where to point the finger. Is it the
mother board and/or bus? The HBAs? and/or the disks? Am I looking in
the right spots?
Any assistance in understanding this is appreciated.
Thanks!
--
Scott LeFevre
317-696-1010
More information about the openindiana-discuss
mailing list