[OpenIndiana-discuss] VMware

Mon Aug 12 20:46:50 UTC 2013

> I think we found your smoking gun.  You're getting ping loss on a local network, and you're using 4x 10Gb LACP bonded network.  And for some reason you say "should be pretty solid."  What you've described is basically the definition of unstable, if you ask me.

No, we're not getting any ping loss, that's the thing.  The network looks entirely faultless.  We've run pings for 24 hours with no ping loss.

> Before anything else, know this:  In LACP, only one network interface can be used per data stream.  So if you have a server with LACP, then each client can go up to 10Gb, but if you have 4 clients simultaneously, they can each go up to 10Gb.  You cannot push 40Gb to a single client.

Each storage server has 5 clients.

> Also, your hard disks are all 1Gbit.  So every 10 disks you have in the server add up to a single 10Gb network interface.  It is absolutely pointless to use LACP in this situation unless you have a huge honking server.  (Meaning >40 disks).

They've got 38 disks.

> In my experience, LACP is usually unstable, unless you buy a really expensive switch

The switches are pretty expensive, we've got Arista switches and SolarFlare NICs in the servers (well, the bond is across a SolarFlare NIC and an Intel NIC).

> and QA test the hell out of your configuration before using it.  I hear lots of people say their LACP is stable and reliable where they are - but it's only because they have never tested it and haven't noticed the problems.  The problems are specifically as you've described.  Occasional packet loss, which people tend to think is ok, but in reality, the only acceptable level of packet loss is 0%.

Yep, 0% packet loss, sorry if I've mis-worded something somewhere, but definitely no dropped packets.

> 
> Figure out how to observe & clear the error counters on all the network interfaces.  Login to the switch to measure them there ...  Login to the server to measure them there ...  Login to each client to measure them there.  Reset them all to 0.  And then start hammering the shit out of the whole system.  Get all the clients to drive the network hard, both transmit and receive.  If you see error counters increasing, you have a problem.

I'll double check but pretty sure that we've reset witnessed no CRC errors over test periods, even when hammering the system.

James.