Posts

Showing posts from January, 2013

Part 2 - Disaster Recovery with SRM and vSphere Replication

Image
In the previous article we went through the installation and configuration of the SRM and vSphere infrastructure. The time has now come to actually doing some tests and failover some VM's.  In the simple scenario I expect everything to go smoothly but there are a few things that I'm concerned about at this point since the protected environment I'm ultimately is failing over isn't so simple. It has multiple dvSwitches, vLans, load balancer, firewall, ldap, a set of test drivers to verify the integrity of the system and access to the system in a sensible way for administrators so there are a few things that needs to be resolved. Protected Setup Test Failover The small test In this scenario I do a test failover of a single machine and verifies that it starts and that I can log into it. The steps are: Setup the machine running RHEL 5.5.  Install VMware tools  Kick of the vSphere Replication Create a Protection Group and ... no that didn...

Part 3 - Disaster Recovery with SRM and vSphere Replication

Image
In the previous articles I have described the setup of SRM and vSphere Replication as well as configuring a small and simple recovery plan (one single machine). Bursting the test bubble when doing a test failover though an additional machine.   In this part the focus will be on the more complex situation when you have a system over multiple VLANs that you want to failover. In the protected site load balancing and routing is done by the load balancer and for multicast by the firewall. Unfortunately these are still HW appliances in my case and I don't feel like running to the datacenter and move them each time someone does a test failover :-) So we need to figure out a way to provide that functionality in the test bubble. Who knows in the end the network guys might actually be willing to discuss virtual load balanced and firewalls. Protected Setup To keep it simple let us only consider two Vlan's in the Protected Setup. Vlan01 and Vlan50. VM01 communic...

Part 1 - Disaster Recovery with SRM and vSphere Replication

Image
Abstract  The company I work for has needs for disaster recovery but there is also external demands for it. The company not being very big has so far not been able to implement any automated disaster recovery solution although we do have a documented disaster recovery plan. The plan regardless of how good or bad it is isn't tested in a long time and the RTO (Recovery Time Objective) is wished by management to be days but I can't see that it is anything else than weeks. Since we virtualized our production system I've been looking at VMware vCenter Site Recovery Manager as a driver but the cost array based replication has stopped any attempts dead in their tracks. VMware vSphere® Replication has been around for a while now and we hope that it's hardened enough for a slightly bigger production critical implementation. In this article I will try to document and explain how we experiment, design and implement disaster recovery and even more important, in my opinion...

Intermittent and excessive loss of UDP traffic

This is a call for help to get an understanding of what happened. We are still flabbergasted and confused. Lets start from the beginning. We have s system consisting of a number of servers that primarily is communicating with each other through TCP. In very few occasions we use UDP. The system is running in a virtual environment (VMware). From one of the services, lets call it US (Udp Sender) we send UDP to another server UR (Udp Receiver).  The US is one or more and the UR is behind a load balancer even though it currently is a single server. After upgrading VMware from 5.0 to 5.1 the UDP packages on the UR end are lost excessively and intermittent. We are using RHEL 5.5 in a quite old version as the guest OS. Checking with tcpdump on the US side we see that all servers send packages as expected. Through divide and concur we rule out switches, load balancers, firewall, physical NICs until we on have the virtual NIC left. So far so good - now for the stra...

I absolutely HATE *nix memory metrics

Today was yet another day at the office and large scale PANIC erupted. Managers running around as if the end of the world was just around the corner. All the fuzz came from the interpretation of free in our production system, the server had only 85 MB free memory left and the end was near!  When taking a closer look it was obvious that the PANIC was unmotivated and that the server had 6.3 GB "free".  Still every time it has been a few weeks since I looked at the metrics I have to go through the following mental process to sort it out: "CRAP!!!  The end is near ... wait! ... calm down .. its linux ... its the  second line that counts.   ... so ... this .. is .. good?! .... sigh of  relief "  Thus I think that its in its place to repeat the memory metrics, what they mean and how to interpret them in yet another blog entry. free -m total used free shared buffers cached Mem: 5963 5581 3...