[OpenIndiana-discuss] Troubleshooting OpenIndiana network on vSphere 5.5

Sun Oct 20 13:52:34 UTC 2013

Hi all,

I'm hoping for some troubleshooting advice. I have an OpenIndiana
oi_151a8 virtual machine which was functioning correctly on vSphere 5.1
but now isn't on vSphere 5.5 (ESXi-5.5.0-1331820-standard)

A small corner of my network infrastructure has a vSphere host upon
which live two virtual machines:
ape - "Debian Linux ape 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC
2012 x86_64 GNU/Linux", uses USB passthrough to read from a APC UPS and
e-mail me when power is lost
giraffe - oi_151a8, serves up virtual machine images over NFS.

Since the upgrade of vSphere from 5.1 to 5.5, virtual machines on other
hosts whose VMDKs are on this NFS mount are now very slow. Putty
sessions to the oi_151a8 VM also 'stutter', and I see patterns in ping,
such as these:

Reply from 192.168.0.13: bytes=32 time=1367ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time=1369ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time=1356ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time=1376ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Reply from 192.168.0.13: bytes=32 time<1ms TTL=255
Request timed out.

At the same time, pings to the neighbouring VM (ape), or the host follow
the normal "time<1ms" pattern, as do pings to other random machines on
the network. I've therefore ruled out switch infrastructure, including
possibly the vSwitch inside this vSphere host given that the 'giraffe'
VM exhibits a problem whereas 'ape' does not.

Interestingly, if I power down VMs whose storage lives on giraffe, the
pings return to sub 1ms. 

I am drawing the conclusion that this is some symptom of the combination
of OI, vSphere 5.5 & network load, although I'm not sure where to turn
next.

Tried:
"zpool scrub rpool" - to induce high read load on the SSD in the vSphere
host. This may look like a strange thing to test, but I've seen odd
effects on Windows machines whose storage is struggling in the past.
Created a test pool on SSD and induced write load using "cat /dev/zero >
/testpool/zerofile".
"zpool scrub giraffepool" - to induce high read load on the spinning
drives. Still no effect from these three tests, further hinting that
it's network load which is a trigger.
Checked that ipfilter is off with the following, yet there is a message
in dmesg: "IP Filter: v4.1.9, running."

chris at giraffe:~# svcs -xv ipfilter
svc:/network/ipfilter:default (IP Filter)
 State: disabled since October 20, 2013 12:17:02 PM UTC
Reason: Disabled by an administrator.
   See: http://illumos.org/msg/SMF-8000-05
   See: man -M /usr/share/man -s 5 ipfilter
Impact: This service is not running.

Haven't tried yet:
Installing OI again in another VM to see if the problem is localised to
giraffe, since I'd also have to induce load to be confident of the issue
existing or not.

I'm using the e1000 NIC in vSphere and don't have VM tools installed.

Any troubleshooting advice to help me focus somewhere would be
appreciated.

Many thanks,
Chris