Part 2 - Disaster Recovery with SRM and vSphere Replication


In the previous article we went through the installation and configuration of the SRM and vSphere infrastructure. The time has now come to actually doing some tests and failover some VM's. 

In the simple scenario I expect everything to go smoothly but there are a few things that I'm concerned about at this point since the protected environment I'm ultimately is failing over isn't so simple. It has multiple dvSwitches, vLans, load balancer, firewall, ldap, a set of test drivers to verify the integrity of the system and access to the system in a sensible way for administrators so there are a few things that needs to be resolved.

Protected Setup


Test Failover

The small test

In this scenario I do a test failover of a single machine and verifies that it starts and that I can log into it.

The steps are:

  1. Setup the machine running RHEL 5.5. 
  2. Install VMware tools 
  3. Kick of the vSphere Replication
  4. Create a Protection Group and ... no that didn't work.
    Fortunately this is just a simple one to fix since I forgot to unmount the VMware iso from the DVD/CD drive. I just removed it and then the Protection Group configuration accepted the machine.
  5. Create the Recovery Plan
    As for the Recovery Plan its just to click through the wizard 
  6. Do "Test"
    The machine starts nicely and, as expected, it can't reach anything outside its test bubble. Nor anything in it. This obviously is a problem in the case you actually want to make some serious testing of the services that you failed over.
  7. Do "Cleanup" to remove the test failover.
The big question now is how to get access to the servers from en external place?

Bursting the Test Bubble

The first step is to create a new machine at the recovery site and install vmware tools. In my case I create a CentOS 6.3 machine (bubblebridge). The intention is to let this new machine act as a bridge between the real world and the test bubble.

Now the machine is created and I can log in through the console. Now we need to do some configuration of the machine to make it useful.
  1. As root
    cp /etc/sysconfig/network-scripts/ifcfg-lo /etc/sysconfig/network-scripts/ifcfg-eth0
    vi /etc/sysconfig/network-scripts/ifcfg-eth0
  2. Make it look like this with reservation for IPADDR, GATEWWAY and NETMASK:
    DEVICE=eth0
    BOOTPROTO=static
    NM_CONTROLLED=yes
    ONBOOT=yes
    TYPE=Ethernet
    IPADDR=
    GATEWAY=
    NETMASK=255.255.255.X
    DNS1=
    DNS2=8.8.8.8 # Google DNS
  3. Change network config by vi /etc/sysconfig/network and make it look like:
    NETWORKING=yes
    HOSTNAME=bubblebridge
    GATEWAY=
    
  4. At this point it should be possible to ping google.com
  5. CentOS 6+ comes with a NetworkManager that I don't agree with so I chose to stop it
    As root stop NetworkManager and remove it from startup
    service stop NetworkManager
    chkconfig NetworkManager off
    chkconfig network on
  6. Id like to ssh into the machine so lets get that up and running to.
    Install packages needed
      yum install sshd openssh-server openssh-clients perl
    
    Turn on ssh
      chkconfig sshd on
      service sshd start
  7. Now you should be able to ssh bubblebridge@
Ok - so far so good. 

Now lets do a new Test failover using the Simply Single Recovery Plan (thats the recovery plan created with only one machine in it). When the machine has failed over the only way to talk to it is through the console. When looking at the virtual machines summary you see that it has a Network attached to it.

To get access to the machine I add a NIC to the bubblebridge. In the picture above I use the virtual switch created by the failover note that this will cause issues further down the stream when cleaning up the test bubble you will get errors such as "Remove virtual switch... The resource srmvs-recovery... is in use". This will happen since the bubblebridge machine isn't part of the test setup and if it is associated to it (can only can be done when you have done a test failover since the network wont be there otherwise) it will block the removal of the resources. 

To mitigate that I have a check in the protected site and check which dvSwitch and VLAN the machine is connected to and create the same setup at the recovery site (I use the postfix -R to denote the recovery network). Don't forget that you have to go to SRM and do the Network Mappings in order for it to take affect

Edit settings and press the "Add..." button
Select Ethernet Adapter

In the Network Type dialog under Network Connection select the network label that the failed over test machine has. And then click through to finish. After completion the bubblebridge machine should have two network cards associated.

Now its time to ssh into bubblebridge again. When you reach the machine you can run ifconfig and discover that you now have two NIC's eth0 and eth1 with the same address. That was not what was intended so lets fix that. Switch to root and do the following:
  1. As root
    cp /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1
    cp /etc/sysconfig/network-scripts/ifcfg-eth1 /etc/sysconfig/network-scripts/ifcfg-eth1.664
    vi /etc/sysconfig/network-scripts/ifcfg-eth1
    
  2. And make ifcfg-eth1 look like
  3. DEVICE=eth1
    BOOTPROTO=static
    NM_CONTROLLED=no
    ONBOOT=yes
    TYPE=Ethernet
    IPADDR=
    NETMASK=255.255.255.0
  4. Now you should be able to ping the test server and access it if its configured to accept ssh. In my case it seems that the server hasn't a sshd running so lets clean up. Fix the issue and do a new test failover and everything should work perfectly.

Current State of the Setup

Recover Test Setup
To summarize where we are at: We have created the dvSwitch-R and Vlan1-R to mimick the network setup of the Protected Site, mapped the resources in SRM to use the appropriate resources also in the test failover. We have created and configured the bubblebridge VM so that it has a NIC outside the test bubble and one NIC tied to Vlan1-R. Now when we do a test failover the "TEST BUBBLE"  gets created by SRM inside the Recovery Site. We can now use ssh from outside the test bubble to connect to the bubblebridge server and from that point jump to the VM in the test bubble. 

Rudimentary access to the test bubble is established.


Comments

Nice blog... Data Protection Disaster Recovery Replication is very important for all online businesses. It help to protect and recover data. Thanks for sharing.

Popular posts from this blog

Possible SYN flooding on port 3306 (MySQL)

Part 1 - Disaster Recovery with SRM and vSphere Replication