[OpenIndiana-discuss] received unsolicited ack for DL_UNITDATA_REQ on bnx0arl_dlpi_pending

James Carlson carlsonj at workingcode.com
Mon Apr 7 16:52:28 UTC 2014


On 04/07/14 09:02, Schweiss, Chip wrote:
> Were you able to resolve the cause of this?
> 
> I found this thread today when one of my servers started having the same
> problem.   Hard lockups with the nearly the same message on the console.
> 
> My affected server is a Supermicro 6037R-TXRF with the X9DRX+-F
> motherboard.   It has 2 Intel nics.  The console message reads:
> 
> received unsolicited ack for DL_UNITDATA_REQ on igp0arl_dlpi_pending
> 
> It repeats 4 times and the system locks.
> 
> This system is completely unloaded and the problem occurs after only a few
> minutes of uptime.
> 
> Thanks for any additional info you can provide.

You replied to my message, and I don't think you were expecting an
answer from me, but just in case: no, I haven't seen any resolution to
this posted anywhere.

I assume you typed the message above by hand.  It says "igp" in your
message, but I know of no such driver on any Solaris derivative.  The
closest match is "igb" (not "igp"), which is supposed to bind to the
Intel i350 hardware.  I believe that's what your motherboard has, but I
don't have that motherboard, nor do I have the output of something
definitive like "scanpci," so I can't tell for sure what's there.

Looking at the code, it seems that this message could possibly be
generated for exactly two cases:

  - driver sent DL_ERROR_ACK with dl_error_primitive == 7
  - driver sent DL_OK_ACK with dl_correct_primitive == 7

(In all other cases, the primitive value we print is hard-wired to
DL_NOTIFY_REQ, DL_INFO_REQ, or DL_BIND_REQ, so the message you see
wouldn't have said "DL_UNITDATA_REQ.")

In either event, we sure weren't expecting that to happen.  My guess
(and it's just a guess based on roughly 9 years working on that code in
the past) is that the driver is erroneously sending DL_ERROR_ACK due to
some internal race condition.  It shouldn't be doing that, as all of
these interfaces are supposed to be in DLPI "connectionless service" mode.

Either way, it looks like the code inside ARP (buried in the IP module
for somewhat historical reasons) does something entirely reasonable: it
discards the message and drives on.  I don't see how that code could
cause the system to lock up.  It might cause an affected Ethernet
interface to become non-responsive if the IP module and the DLPI driver
were out of sync about the current state (e.g., the driver thought it
replied to a request that required a response, but the response was
malformed so the IP module dropped it), but it shouldn't cause the
system itself to lock up.

In any event, it sounds like a driver bug of some sort.  If careful
dtrace work and source code inspection doesn't reveal the problem, then
someone who has a system experiencing this problem is going to have to
instrument the driver and find out what's going on.

-- 
James Carlson         42.703N 71.076W         <carlsonj at workingcode.com>



More information about the OpenIndiana-discuss mailing list