[OpenIndiana-discuss] Laptop locking up, possibly on rge device - rings a bell to anyone?

Sat Oct 20 11:21:59 UTC 2012

Update on fight with rge on my laptop:

I've tried to set acpi and rge options in "eeprom" (bootenv.rc),
and indeed bound the NIC to a particular CPU core. I've also
tried an alternate driver (gani), and disabled HW checksums.
None of these helped much.

The intermittent hangs do still appear with both stock rge and
gani-2.6.9 drivers - almost any activity on the NIC (about 180
to 220 Kb worth of downloads with wget or ssh) makes it process
or issue(?) between 75k and 110k interrupts per second on driver,
eating 25%-35% of a CPU core, and often locking up the mouse/kbd.
The lockups don't happen 100% of the time, and not always there
is even a noticeable lag. X11 screen updates come through, so
in "vmstat" and "intrstat" I see this storm of intrs. Sometimes
they dissipate on their own (a watchdog timer in driver?) and
often they disappear when I unplug the network, wait several
seconds (>5) and plug it back in. Apparently, this requires
any TCP sessions to be restarted (ssh, rsync, wget, etc.)

Now I have not yet seen the networking disappear completely until
reboot, but its usability is still unpredictable and usually bad.

Ideas welcome - i.e. how can I trace what's happening in the intr
storm? I'd guess some infinite loop, small enough to happen very
quickly. Maybe hardware related, since it happens with two drivers
(rge and gani, I didn't check how close their code is)...
I wonder if ultimately this condition can be detected and aborted
early, i.e. within a second and causing no networking loss to upper
layers.

2012-09-28 10:48, Jim Klimov wrote:
> Thanks Marion for the pointers, I also figured it looks like
> a interrupt problem (back in MSDOS times "kicking" the computer
> hung in a game by moving a mouse could unhang it ;) )
>
> However, now that I've checked, I don't see rge sharing an
> IRQ vector with anything. It is of an MSI type currently bound
> to CPU1 (and a pcieb is bound to CPU0 being the only other MSI
> interrupt); I wonder if the CPU binding matters for this bug,
> and if it can be controlled to test.
>
> The only shared interrupts I see are two ehci driver instances
> on one IRQ and three ohci instances on another.
>
> Also, this is not an NVidia box but an AMD/ATI one (with the
> integrated APU = CPU+GPU).
>
> I've had a boot after my email where again I worked for hours
> and intensively used the net without problems; then I had
> boots where net hung upon IOs (i.e. I could start an "scp"
> session from another machine and authenticate with a password,
> but the actual file copy hung it), and currently the link is
> down right from the bootup...
>
> Thanks for more ideas,
> //Jim
>
> 2012-09-27 6:50, Marion Hakanson wrote:
>> jimklimov at cos.ru said:
>>> . . .
>>>     Ultimately, after about one hour of such intermittent work with
>>> no actual
>>> usage on my behalf, the LAN interface went down and did not come back up
>>> until a full reboot (I did not try fastboot though). I have no idea
>>> if this
>>> will be reproducible :)
>>> . . .
>>
>> I wonder if this is similar to something I've seen, which I think was
>> eventually categorized as an interrupt-sharing problem.  On my systems,
>> the graphics locked up, along with USB mouse, and on one of them an
>> internal disk interface also had timeouts during the "freeze".  All
>> the affected devices were sharing the same IRQ.
>>
>> You can see what OI thinks your laptop is doing, interrupt-wise, via:
>>     echo "::interrupts -d" | mdb -k
>>
>> I'm not sure how one can fix it.  I was able to disable enough USB ports
>> in the BIOS on one machine to alleviate the IRQ-sharing, and the problem
>> stopped happening there.  Here's the (closed) bug report:
>>     https://www.illumos.org/issues/1625
>>
>> Regards,
>>
>> Marion
>