Intermittent and excessive loss of UDP traffic

This is a call for help to get an understanding of what happened. We are still flabbergasted and confused.

Lets start from the beginning. We have s system consisting of a number of servers that primarily is communicating with each other through TCP. In very few occasions we use UDP. The system is running in a virtual environment (VMware). From one of the services, lets call it US (Udp Sender) we send UDP to another server UR (Udp Receiver).  The US is one or more and the UR is behind a load balancer even though it currently is a single server.

After upgrading VMware from 5.0 to 5.1 the UDP packages on the UR end are lost excessively and intermittent. We are using RHEL 5.5 in a quite old version as the guest OS. Checking with tcpdump on the US side we see that all servers send packages as expected. Through divide and concur we rule out switches, load balancers, firewall, physical NICs until we on have the virtual NIC left.

So far so good - now for the strange. In an attempt to replicate the error in a simpler and easier to test fashion I wrote a small python script sending the exact same packages. Through tcpdump we verify that the packages are identical (and the UR is also fooled by the packages). The strange thing is that the packages sent by the java app at the US end are lost as before while the same packages sent from US using the python script aren't - not even once. The catching java packages and python packages was done with the following command: tcpdump -c 2 -nn -Ss0 -i eth0 udp and UR-loadbalancer-IP

The fix of the problem was to switch the e1000 guest NIC driver to a VMXNET 3 driver. We are still waiting on both VMware and RedHat to clarify the issue.

If someone can explain why packages sent by java are sometimes lost while identical packages sent by python aren't. The obvious would be that the packages aren't identical but as far as I can tell they are....




Comments

Popular posts from this blog

Possible SYN flooding on port 3306 (MySQL)

Part 1 - Disaster Recovery with SRM and vSphere Replication

Part 2 - Disaster Recovery with SRM and vSphere Replication