[OpenIndiana-discuss] SATA device errors, possibly due to IRQ conflict

Thu Sep 29 09:31:37 UTC 2011

Hi,

I've got a server running OpenIndiana 148 on a Supermicro *X8ST3-F* that has
been working perfectly for months right up until I added some more storage.

The board has 6 * SATA ports and 8 * SAS ports. Previously all the drives in
my storage pool were attached to the 8 SAS ports and only my rpool drive was
using one of the SATA ports.

Now that I have added another 4 drives I've had to connect them to the SATA
ports - this is when the system started to become unstable.

I have had periods of very heavy usage that have cause no problems
whatsoever (for example, I copied 4 TB of data on to the pool, most of which
would have had to go on the new drives then did several scrubs over the next
few days). The system seems perfectly happy to sustain a 350mb+ read or
write (or a bit of both) for hours on end with no errors at all. Then other
times, typically overnight or early morning when it's just ticking over with
< 500k read/write, it will fall apart.

There are three kinds of failure I'm experiencing, seemingly randomly:

1. Errors about failed read/write on 2 or 4 SATA drives in /var/adm/messages
and system io hung - system has to have the power cut to recover - ssh won't
connect, can't get past the username prompt on the terminal. No ZFS errors
reported
2. Errors about failed read/write, system io NOT hung, ZFS reporting faulted
drives (2 or 4) and hundreds of thousands of errors. In this scenario, the
machine can be rebooted cleanly BUT the failed drives don't get detected by
BIOS. Usually a full power down, wait 30 seconds, power back up will allow
the drives to be detected again. When it powers back up ZFS will report lots
of errors but sort itself out after a resilver - I haven't actually had any
perminent data loss yet, zfs has always recovered.
3. No errors at all in either /var/adm/messages or zpool status but hung io.

I've swaped the drive connections around to prove it isn't the new disks
that are at fault and this has confirmed that it's whichever devices are
connected to the SATA controller that are having the problem.

When I rebooted the machine after the latest failure I checked the
/var/adm/messages and there are thousands (9995 in total but that may be
from several reboots) messages identical to the following:

"[ID 954099 kern.info] NOTICE: IRQ19 is being shared by drivers with
different interrupt levels."

In case it's useful:

cs2dsb at chronos:~$ echo ::interrupts -d | pfexec mdb -k
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s)
9    0x80 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
11   0xd1 14  PCI    Lvl Fixed  2   1     0x0/0xb   hpet_isr
16   0x84 9   PCI    Lvl Fixed  7   1     0x0/0x10  uhci#0
18   0x82 9   PCI    Lvl Fixed  5   2     0x0/0x12  uhci#5, ehci#0
19   0x86 9   PCI    Lvl Fixed  3   6     0x0/0x13  uhci#4, uhci#2,
pci-ide#0,
pci-ide#1, pci-ide#1, pci-ide#0
21   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x15  uhci#1
23   0x83 9   PCI    Lvl Fixed  6   2     0x0/0x17  uhci#3, ehci#1
24   0x81 7   PCI    Edg MSI    4   1     -         pcieb#4
25   0x60 6   PCI    Edg MSI    1   1     -         e1000g#0
26   0x61 6   PCI    Edg MSI    2   1     -         e1000g#1
27   0x40 5   PCI    Edg MSI    3   1     -         mpt#0
32   0x20 2          Edg IPI    all 1     -         cmi_cmci_trap
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd3 14         Edg IPI    all 1     -         cbe_fire
210  0xd4 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         apic_error_intr

cs2dsb at chronos:~$ echo ::interrupts | pfexec mdb -k
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)
9    0x80 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
11   0xd1 14  PCI    Lvl Fixed  2   1     0x0/0xb   hpet_isr
16   0x84 9   PCI    Lvl Fixed  7   1     0x0/0x10  uhci_intr
18   0x82 9   PCI    Lvl Fixed  5   2     0x0/0x12  uhci_intr, ehci_intr
19   0x86 9   PCI    Lvl Fixed  3   6     0x0/0x13  uhci_intr, uhci_intr,
ata_intr, ata_intr, ata_intr, ata_intr
21   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x15  uhci_intr
23   0x83 9   PCI    Lvl Fixed  6   2     0x0/0x17  uhci_intr, ehci_intr
24   0x81 7   PCI    Edg MSI    4   1     -         pcieb_intr_handler
25   0x60 6   PCI    Edg MSI    1   1     -         e1000g_intr_pciexpress
26   0x61 6   PCI    Edg MSI    2   1     -         e1000g_intr_pciexpress
27   0x40 5   PCI    Edg MSI    3   1     -         mpt_intr
32   0x20 2          Edg IPI    all 1     -         cmi_cmci_trap
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd3 14         Edg IPI    all 1     -         cbe_fire
210  0xd4 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         apic_error_intr

So, basically two questions:

1. How do I fix this IRQ issue so that I don't get those warnings during
boot up?
2. Is this likely to be the cause of the drive problems described above?

Any advice would be much appreciated.

Thanks,

Daniel