[OpenIndiana-discuss] SATA device errors, possibly due to IRQ conflict

Thu Sep 29 20:18:05 UTC 2011

Right, finally got it all booting again ;)

I have upgraded the bios to the latest version and set SATA#1 to ACPI
version 3.0.

It's now back up and both tank and rpool are happy so far and the IRQ errors
are gone from /var/adm/messages.

The only lines that concern me now are:

Sep 29 21:05:34 chronos acpica: [ID 642512 kern.notice] ACPI Warning:
Incorrect checksum in table [OEMB] - 8B, should be 88 (20091112/tbutils-351)

And:

Sep 29 21:06:36 chronos scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,3410 at 9
/pci15d9,5 at 0 (mpt0):
Sep 29 21:06:36 chronos         Rev. 8 LSI, Inc. 1068E found.
Sep 29 21:06:36 chronos scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,3410 at 9
/pci15d9,5 at 0 (mpt0):
Sep 29 21:06:36 chronos         mpt0 supports power management.
Sep 29 21:06:36 chronos scsi: [ID 243001 kern.info] /pci at 0,0/pci8086,3410 at 9
/pci15d9,5 at 0 (mpt0):
Sep 29 21:06:36 chronos         DMA can't cross 4GB boundary due to errata

But I've searched for both and they have been present on boot ever since the
system was built. Hopefully they aren't issues I need to worry about?

I'm going to kick off a scrub before I go to bed tonight and hopefully that
will be clear and the pool is still functional in the morning!

Thanks for the help, I'll post back if the issue persists.

Thanks,

Daniel

On 29 September 2011 17:53, Daniel <cs2dsb at gmail.com> wrote:

> since your bios doesnt reliably detect drives on reboot i would start by
>> looking for firmware upgrades.
>>
>> smc likes to say dont upgrade your firmware unless you have a problem, i
>> be you meet that requirement.
>>
>> j
>>
>
> Thanks, I'm in the process of trying to do this but I'm having some trouble
> getting it to recognize any of the bootable disks I've created. I'll dig out
> a USB stick and try that since I'm having no joy with the virtual CD/floppy
> options.
>
>
>  In your BIOS you should have a setting for "Interrupt 19 Capture" and, I
>> believe, the default setting is "Enable".  Change it to "Disable".  This
>> will disable your ability to boot off your SAS controller but you don't do
>> that right now anyway.
>>
>> Good luck!
>>
>> -Russ
>>
>
> I've done this but it gives the same errors.
>
>
>
>> Some further digging there appears to be a similar issue on the FreeBSD
>> side of things with the Tylersburg chipset (found on the X8ST3-F).
>>
>> http://lists.freebsd.org/pipermail/freebsd-current/2009-July/009946.html
>>
>> The USB devices and the SATA devices all contend with IRQ19 (as seen by
>> uhci and pci-ide all piled together).  Might it be possible to switch your
>> SATA mode to AHCI rather than IDE?  That will use a different driver and
>> subsequently might use a different interrupt.
>>
>> -Russ
>
>
> I believe it is possible to change the SATA port types in bios but from
> memory when I did this previously it prevented the OS (OpenSolaris at the
> time) from booting  because the rpool physical path changed and I had to go
> in and modify something with a Live disk to make it work. I will see if I
> can dig out my notes or find an article on this.
>
> Thanks for the advice so far guys. I'll let you know when I make some
> progress.
>
> Cheers,
>
> Daniel
> On 29 September 2011 17:36, Russell Hansen <russhan at new-swankton.net>wrote:
>
>> Some further digging there appears to be a similar issue on the FreeBSD
>> side of things with the Tylersburg chipset (found on the X8ST3-F).
>>
>> http://lists.freebsd.org/pipermail/freebsd-current/2009-July/009946.html
>>
>> The USB devices and the SATA devices all contend with IRQ19 (as seen by
>> uhci and pci-ide all piled together).  Might it be possible to switch your
>> SATA mode to AHCI rather than IDE?  That will use a different driver and
>> subsequently might use a different interrupt.
>>
>> -Russ
>>
>> ________________________________
>>
>> From: Russell Hansen [mailto:russhan at new-swankton.net]
>> Sent: Thu 9/29/2011 8:59 AM
>> To: Discussion list for OpenIndiana
>> Subject: Re: [OpenIndiana-discuss] SATA device errors,possibly due to IRQ
>> conflict
>>
>>
>>
>> In your BIOS you should have a setting for "Interrupt 19 Capture" and, I
>> believe, the default setting is "Enable".  Change it to "Disable".  This
>> will disable your ability to boot off your SAS controller but you don't do
>> that right now anyway.
>>
>> Good luck!
>>
>> -Russ
>>
>> ________________________________
>>
>> From: Daniel [mailto:cs2dsb at gmail.com]
>> Sent: Thu 9/29/2011 2:31 AM
>> To: openindiana-discuss at openindiana.org
>> Subject: [OpenIndiana-discuss] SATA device errors,possibly due to IRQ
>> conflict
>>
>>
>>
>> Hi,
>>
>> I've got a server running OpenIndiana 148 on a Supermicro *X8ST3-F* that
>> has
>> been working perfectly for months right up until I added some more
>> storage.
>>
>> The board has 6 * SATA ports and 8 * SAS ports. Previously all the drives
>> in
>> my storage pool were attached to the 8 SAS ports and only my rpool drive
>> was
>> using one of the SATA ports.
>>
>> Now that I have added another 4 drives I've had to connect them to the
>> SATA
>> ports - this is when the system started to become unstable.
>>
>> I have had periods of very heavy usage that have cause no problems
>> whatsoever (for example, I copied 4 TB of data on to the pool, most of
>> which
>> would have had to go on the new drives then did several scrubs over the
>> next
>> few days). The system seems perfectly happy to sustain a 350mb+ read or
>> write (or a bit of both) for hours on end with no errors at all. Then
>> other
>> times, typically overnight or early morning when it's just ticking over
>> with
>> < 500k read/write, it will fall apart.
>>
>> There are three kinds of failure I'm experiencing, seemingly randomly:
>>
>> 1. Errors about failed read/write on 2 or 4 SATA drives in
>> /var/adm/messages
>> and system io hung - system has to have the power cut to recover - ssh
>> won't
>> connect, can't get past the username prompt on the terminal. No ZFS errors
>> reported
>> 2. Errors about failed read/write, system io NOT hung, ZFS reporting
>> faulted
>> drives (2 or 4) and hundreds of thousands of errors. In this scenario, the
>> machine can be rebooted cleanly BUT the failed drives don't get detected
>> by
>> BIOS. Usually a full power down, wait 30 seconds, power back up will allow
>> the drives to be detected again. When it powers back up ZFS will report
>> lots
>> of errors but sort itself out after a resilver - I haven't actually had
>> any
>> perminent data loss yet, zfs has always recovered.
>> 3. No errors at all in either /var/adm/messages or zpool status but hung
>> io.
>>
>>
>> I've swaped the drive connections around to prove it isn't the new disks
>> that are at fault and this has confirmed that it's whichever devices are
>> connected to the SATA controller that are having the problem.
>>
>> When I rebooted the machine after the latest failure I checked the
>> /var/adm/messages and there are thousands (9995 in total but that may be
>> from several reboots) messages identical to the following:
>>
>> "[ID 954099 kern.info] NOTICE: IRQ19 is being shared by drivers with
>> different interrupt levels."
>>
>> In case it's useful:
>>
>> cs2dsb at chronos:~$ echo ::interrupts -d | pfexec mdb -k
>> IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s)
>> 9    0x80 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
>> 11   0xd1 14  PCI    Lvl Fixed  2   1     0x0/0xb   hpet_isr
>> 16   0x84 9   PCI    Lvl Fixed  7   1     0x0/0x10  uhci#0
>> 18   0x82 9   PCI    Lvl Fixed  5   2     0x0/0x12  uhci#5, ehci#0
>> 19   0x86 9   PCI    Lvl Fixed  3   6     0x0/0x13  uhci#4, uhci#2,
>> pci-ide#0,
>> pci-ide#1, pci-ide#1, pci-ide#0
>> 21   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x15  uhci#1
>> 23   0x83 9   PCI    Lvl Fixed  6   2     0x0/0x17  uhci#3, ehci#1
>> 24   0x81 7   PCI    Edg MSI    4   1     -         pcieb#4
>> 25   0x60 6   PCI    Edg MSI    1   1     -         e1000g#0
>> 26   0x61 6   PCI    Edg MSI    2   1     -         e1000g#1
>> 27   0x40 5   PCI    Edg MSI    3   1     -         mpt#0
>> 32   0x20 2          Edg IPI    all 1     -         cmi_cmci_trap
>> 160  0xa0 0          Edg IPI    all 0     -         poke_cpu
>> 208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
>> 209  0xd3 14         Edg IPI    all 1     -         cbe_fire
>> 210  0xd4 14         Edg IPI    all 1     -         cbe_fire
>> 240  0xe0 15         Edg IPI    all 1     -         xc_serv
>> 241  0xe1 15         Edg IPI    all 1     -         apic_error_intr
>>
>> cs2dsb at chronos:~$ echo ::interrupts | pfexec mdb -k
>> IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)
>> 9    0x80 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
>> 11   0xd1 14  PCI    Lvl Fixed  2   1     0x0/0xb   hpet_isr
>> 16   0x84 9   PCI    Lvl Fixed  7   1     0x0/0x10  uhci_intr
>> 18   0x82 9   PCI    Lvl Fixed  5   2     0x0/0x12  uhci_intr, ehci_intr
>> 19   0x86 9   PCI    Lvl Fixed  3   6     0x0/0x13  uhci_intr, uhci_intr,
>> ata_intr, ata_intr, ata_intr, ata_intr
>> 21   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x15  uhci_intr
>> 23   0x83 9   PCI    Lvl Fixed  6   2     0x0/0x17  uhci_intr, ehci_intr
>> 24   0x81 7   PCI    Edg MSI    4   1     -         pcieb_intr_handler
>> 25   0x60 6   PCI    Edg MSI    1   1     -         e1000g_intr_pciexpress
>> 26   0x61 6   PCI    Edg MSI    2   1     -         e1000g_intr_pciexpress
>> 27   0x40 5   PCI    Edg MSI    3   1     -         mpt_intr
>> 32   0x20 2          Edg IPI    all 1     -         cmi_cmci_trap
>> 160  0xa0 0          Edg IPI    all 0     -         poke_cpu
>> 208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
>> 209  0xd3 14         Edg IPI    all 1     -         cbe_fire
>> 210  0xd4 14         Edg IPI    all 1     -         cbe_fire
>> 240  0xe0 15         Edg IPI    all 1     -         xc_serv
>> 241  0xe1 15         Edg IPI    all 1     -         apic_error_intr
>>
>>
>> So, basically two questions:
>>
>> 1. How do I fix this IRQ issue so that I don't get those warnings during
>> boot up?
>> 2. Is this likely to be the cause of the drive problems described above?
>>
>> Any advice would be much appreciated.
>>
>> Thanks,
>>
>> Daniel
>> _______________________________________________
>> OpenIndiana-discuss mailing list
>> OpenIndiana-discuss at openindiana.org
>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>>
>>
>>
>>
>>
>> _______________________________________________
>> OpenIndiana-discuss mailing list
>> OpenIndiana-discuss at openindiana.org
>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>>
>>
>