[OpenIndiana-discuss] SATA device errors, possibly due to IRQ conflict
Daniel
cs2dsb at gmail.com
Thu Sep 29 09:31:37 UTC 2011
Hi,
I've got a server running OpenIndiana 148 on a Supermicro *X8ST3-F* that has
been working perfectly for months right up until I added some more storage.
The board has 6 * SATA ports and 8 * SAS ports. Previously all the drives in
my storage pool were attached to the 8 SAS ports and only my rpool drive was
using one of the SATA ports.
Now that I have added another 4 drives I've had to connect them to the SATA
ports - this is when the system started to become unstable.
I have had periods of very heavy usage that have cause no problems
whatsoever (for example, I copied 4 TB of data on to the pool, most of which
would have had to go on the new drives then did several scrubs over the next
few days). The system seems perfectly happy to sustain a 350mb+ read or
write (or a bit of both) for hours on end with no errors at all. Then other
times, typically overnight or early morning when it's just ticking over with
< 500k read/write, it will fall apart.
There are three kinds of failure I'm experiencing, seemingly randomly:
1. Errors about failed read/write on 2 or 4 SATA drives in /var/adm/messages
and system io hung - system has to have the power cut to recover - ssh won't
connect, can't get past the username prompt on the terminal. No ZFS errors
reported
2. Errors about failed read/write, system io NOT hung, ZFS reporting faulted
drives (2 or 4) and hundreds of thousands of errors. In this scenario, the
machine can be rebooted cleanly BUT the failed drives don't get detected by
BIOS. Usually a full power down, wait 30 seconds, power back up will allow
the drives to be detected again. When it powers back up ZFS will report lots
of errors but sort itself out after a resilver - I haven't actually had any
perminent data loss yet, zfs has always recovered.
3. No errors at all in either /var/adm/messages or zpool status but hung io.
I've swaped the drive connections around to prove it isn't the new disks
that are at fault and this has confirmed that it's whichever devices are
connected to the SATA controller that are having the problem.
When I rebooted the machine after the latest failure I checked the
/var/adm/messages and there are thousands (9995 in total but that may be
from several reboots) messages identical to the following:
"[ID 954099 kern.info] NOTICE: IRQ19 is being shared by drivers with
different interrupt levels."
In case it's useful:
cs2dsb at chronos:~$ echo ::interrupts -d | pfexec mdb -k
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# Driver Name(s)
9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr
11 0xd1 14 PCI Lvl Fixed 2 1 0x0/0xb hpet_isr
16 0x84 9 PCI Lvl Fixed 7 1 0x0/0x10 uhci#0
18 0x82 9 PCI Lvl Fixed 5 2 0x0/0x12 uhci#5, ehci#0
19 0x86 9 PCI Lvl Fixed 3 6 0x0/0x13 uhci#4, uhci#2,
pci-ide#0,
pci-ide#1, pci-ide#1, pci-ide#0
21 0x85 9 PCI Lvl Fixed 0 1 0x0/0x15 uhci#1
23 0x83 9 PCI Lvl Fixed 6 2 0x0/0x17 uhci#3, ehci#1
24 0x81 7 PCI Edg MSI 4 1 - pcieb#4
25 0x60 6 PCI Edg MSI 1 1 - e1000g#0
26 0x61 6 PCI Edg MSI 2 1 - e1000g#1
27 0x40 5 PCI Edg MSI 3 1 - mpt#0
32 0x20 2 Edg IPI all 1 - cmi_cmci_trap
160 0xa0 0 Edg IPI all 0 - poke_cpu
208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr
209 0xd3 14 Edg IPI all 1 - cbe_fire
210 0xd4 14 Edg IPI all 1 - cbe_fire
240 0xe0 15 Edg IPI all 1 - xc_serv
241 0xe1 15 Edg IPI all 1 - apic_error_intr
cs2dsb at chronos:~$ echo ::interrupts | pfexec mdb -k
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr
11 0xd1 14 PCI Lvl Fixed 2 1 0x0/0xb hpet_isr
16 0x84 9 PCI Lvl Fixed 7 1 0x0/0x10 uhci_intr
18 0x82 9 PCI Lvl Fixed 5 2 0x0/0x12 uhci_intr, ehci_intr
19 0x86 9 PCI Lvl Fixed 3 6 0x0/0x13 uhci_intr, uhci_intr,
ata_intr, ata_intr, ata_intr, ata_intr
21 0x85 9 PCI Lvl Fixed 0 1 0x0/0x15 uhci_intr
23 0x83 9 PCI Lvl Fixed 6 2 0x0/0x17 uhci_intr, ehci_intr
24 0x81 7 PCI Edg MSI 4 1 - pcieb_intr_handler
25 0x60 6 PCI Edg MSI 1 1 - e1000g_intr_pciexpress
26 0x61 6 PCI Edg MSI 2 1 - e1000g_intr_pciexpress
27 0x40 5 PCI Edg MSI 3 1 - mpt_intr
32 0x20 2 Edg IPI all 1 - cmi_cmci_trap
160 0xa0 0 Edg IPI all 0 - poke_cpu
208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr
209 0xd3 14 Edg IPI all 1 - cbe_fire
210 0xd4 14 Edg IPI all 1 - cbe_fire
240 0xe0 15 Edg IPI all 1 - xc_serv
241 0xe1 15 Edg IPI all 1 - apic_error_intr
So, basically two questions:
1. How do I fix this IRQ issue so that I don't get those warnings during
boot up?
2. Is this likely to be the cause of the drive problems described above?
Any advice would be much appreciated.
Thanks,
Daniel
More information about the OpenIndiana-discuss
mailing list