[OpenIndiana-discuss] mask pci-e errors

jason matthews jason at broken.net
Mon Jan 13 17:59:20 UTC 2014



I have 40 identically configured systems that catch the pci-e error below. It seems that about every six months plus or minus, they go through a cycle where they generate this error usually all forty within about three weeks and they are good for months. Bad juju.

The systems are Intel SR2625URLXR, 9207-8i, Intel 910, and 9205-8e on L5630 CPUs with 96gb of ram. The result of the failure is that zfs and zpool commands commands hang on the intel 910 card. Regular file system disk I/O is okay, but zpool and zfs commands hang. 

I am looking for a work around as  the storage continues to work for applications despite the error. Perhaps the error could be masked before FMD takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't know. The hang up seems to be in zfs where system thinks the storage is hosed and zfs/zpool commands hang. As I say regular file system I/Os work just peachy. Does anyone have any ideas on how to overcome this problem without rebooting?

I use clones of file systems to stand up short lived databases to run long batch queries against and when this happens i tend to have fairly crappy work day satisfaction.

Perhaps this is related to:
https://www.illumos.org/issues/315

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03652921-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=4091412&ac.admitted=1389635734908.876444892.492883150

It seems Oracle may have patched similar issues.
thanks,
j.


root at db020:~# fmadm faulty -ai
--------------- ------------------------------------  -------------- ---------
TIME            CACHE-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 08 13:47:15 2a74a865-ba4e-c3b0-e437-e0e34ba53623  PCIEX-8000-0A  Critical  

Host        : db020
Platform    : S5520UR   Chassis_id  : ............
Product_sn  : 

Fault class : fault.io.pciex.device-interr
Affects     : dev:////pci@0,0/pci8086,340c@5/pci111d,806a@0/pci111d,806a@4/pci1000,3020@0
                  faulted and taken out of service
FRU         : "FH PCIE-SLOT2 x8" (hc://:product-id=S5520UR:server-id=db020:chassis-id=............/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=4/pciexdev=0)
                  faulty

Description : A problem was detected for a PCIEX device.
              Refer to http://sun.com/msg/PCIEX-8000-0A for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device.  Use
              fmadm faulty to identify the device or contact Sun for support.




More information about the OpenIndiana-discuss mailing list