[OpenIndiana-discuss] VMware
Edward Ned Harvey (openindiana)
openindiana at nedharvey.com
Mon Aug 12 10:51:48 UTC 2013
> -----Original Message-----
> From: James Relph [mailto:james at themacplace.co.uk]
> Sent: Sunday, August 11, 2013 10:59 AM
>
> although would we lose pings
> with that (had pings running to test for a network issue and never had packet
> loss)? It's a bit of a puzzler!
Hold on now ...
> From: James Relph [mailto:james at themacplace.co.uk]
> Sent: Sunday, August 11, 2013 12:59 PM
>
>dedicated physical 10Gb network for iSCSI/NFS traffic, with 4x 10Gb
> links (in an LACP bond) per device. Should be pretty solid really.
I think we found your smoking gun. You're getting ping loss on a local network, and you're using 4x 10Gb LACP bonded network. And for some reason you say "should be pretty solid." What you've described is basically the definition of unstable, if you ask me.
Before anything else, know this: In LACP, only one network interface can be used per data stream. So if you have a server with LACP, then each client can go up to 10Gb, but if you have 4 clients simultaneously, they can each go up to 10Gb. You cannot push 40Gb to a single client.
Also, your hard disks are all 1Gbit. So every 10 disks you have in the server add up to a single 10Gb network interface. It is absolutely pointless to use LACP in this situation unless you have a huge honking server. (Meaning >40 disks).
In my experience, LACP is usually unstable, unless you buy a really expensive switch and QA test the hell out of your configuration before using it. I hear lots of people say their LACP is stable and reliable where they are - but it's only because they have never tested it and haven't noticed the problems. The problems are specifically as you've described. Occasional packet loss, which people tend to think is ok, but in reality, the only acceptable level of packet loss is 0%.
Here's what you need to do:
Figure out how to observe & clear the error counters on all the network interfaces. Login to the switch to measure them there ... Login to the server to measure them there ... Login to each client to measure them there. Reset them all to 0. And then start hammering the shit out of the whole system. Get all the clients to drive the network hard, both transmit and receive. If you see error counters increasing, you have a problem.
Based on what you've said so far, I guarantee you're going to see error counters increasing. Unless you ignore my advice and don't do these tests ... because these tests are difficult to do, and like I said, several times I've seen sysadmins swear their system was reliable, only to be proven wrong when *actually* put to the test.
I also encounter a lot: Mailing lists exactly like this one, I say something just like above, and other people come back and argue about it, *insisting* that it's ok to have occasional packet loss, on a LAN or WAN. I swear to you, as an IT consultant, this provides a lot of my sustenance - I get called into places with either storage problems or internet problems, and if there is packet loss >0% that is ultimately the root cause of their problem. Never seen an exception.
Because this argument invariably leads to argument, I won't respond to any of it. I've simply grown tired of arguing about it with other people elsewhere. It's definitely a trending pattern. The way I see it, I provide you free advice on a mailing list, if you don't take it, so be it. I continue to get paid.
More information about the OpenIndiana-discuss
mailing list