Part 1 - Disaster Recovery with SRM and vSphere Replication

Abstract 

The company I work for has needs for disaster recovery but there is also external demands for it. The company not being very big has so far not been able to implement any automated disaster recovery solution although we do have a documented disaster recovery plan. The plan regardless of how good or bad it is isn't tested in a long time and the RTO (Recovery Time Objective) is wished by management to be days but I can't see that it is anything else than weeks.

Since we virtualized our production system I've been looking at VMware vCenter Site Recovery Manager as a driver but the cost array based replication has stopped any attempts dead in their tracks. VMware vSphere® Replication has been around for a while now and we hope that it's hardened enough for a slightly bigger production critical implementation.

In this article I will try to document and explain how we experiment, design and implement disaster recovery and even more important, in my opinion, disaster avoidance based on SRM and vSphere Replication.

The Setup

The intended target is our production systems. This setup consists of a number of data centers geographically distributed over the globe. The intention is to fail all of our production systems to one location i.e. DC 3 (PF - primary failover)  and to fail the DC 3 itself to a secondary failover DC 4 (SF - secondary failover). Please note that if DC3 has a disaster DC1, DC2 and DC4 will no longer have a failover site and DC3 will failover to DC4.


The setup obviously has latency of various degrees between the sites but that is not a major concern, at least not at this time. The major concern that I have is the bandwidth consumption of vSphere Replication. Some data centers suffer from absolutely ludicrous bandwidth costs, and my fear is that the amount of data replicated will be too high so I believe that the test system would have to include some WAN-acceleration and bandwidth throttling in the end.

The test setup concists is our staging environment that I will setup to failover to a small temporary system. Most likely the failover system won't be able to handle the entire system in a failover but I hope that it will be enough to start the minimal set of services needed to prove a successful failover of the entire system. At least the exercise will be capable of answering questions about how to setup SRM, bandwidth requirements for replication, and a ton of other questions.

Installation

I will not describe the installation of the appliances vCenter, vSphere Replication or Site Recovery Manager other than very briefly since there are described in detail by VMware in:

Installation phases

Basic requirements for installing SRM is pretty obvious. You need to have two sites with its own HW, each site needs its own vCenter and its own SRM installation.
  1. Install vCenter in both locations. 
  2. vCenter inventory configuration
  3. Install and configure SRM database
  4. SRM installation at both locations
  5. SRM plug-in installation
  6. Pair the sites
Use fully qualified domain names during the installation to avoid confusion and to achieve a reliable setup.

vSphere Replication 

vSphere Replication is bundled with vSphere with no additional cost  although in this form it's in a per VM basis. With SRM you get the orchestration and management of vSphere Replication on a higher level. vSphere Replication manage disk replication in a storage agnostic way i.e. you can replicate between different storage vendors, local disks or what ever you need.

(c) 2012 VMware Inc

The magic comes from the VR agent installed in each ESXi host. It is recording the actual changes, on block level, to be replicated per VM when the RPO limit is approaching. 

For instructions on the installation in a SRM scenario please see the excellent explanation in:

Inventory Mapping

Inventory mapping is simply house keeping making sure that the oranges aren't mixed with apples. If you get a fruit salad of VM, folders, datastores and networks in a disaster situation you surely gone be even more sad than before. So to make you're life simple spend some time mapping up the protected site so that its keept separate from the things living in the recovery site.

Protection Groups

What is it? A group of machines that failover together.

In an array based replication scenario the protection groups pretty much is defined by the datastores and the VM's living together in a datastore since you failover entire datastores. 

When we use vSphere Replication gets a slightly different meaning. vSphere Replication is not dictated by the storage setup its dictated by you. This means that you can freely create protection groups for VM's that should failover together from a service perspective.

Configuring vSphere Replication

Once you completed the Installation above you can start setting up the vSphere Replication. This can be done either by selecting the vm to replicate and choose vSphere Replication in the drop down menu or by selecting the vSphere Replication tab and clicking on the Configure Replication link.

The replication wizard will popup and ask a number of question:
Configure Replication - page 1
In the first page you have to specify the RPO (Recovery Point Objective i.e. how much data can be lost at worst) and the storage at the Target Location. The RPO can be from 24 hours down to 15 minutes. In our case 15 minutes is not enough as a RPO for data/information thus I leave the default at 4 hours for now. As for the Target Location - make sure that you have enough space.
Configure Replication - page 2
On the second page you specify if the disk should be replicated or not as well as its location and disk type. As I'm specifying this I realize that the test machine only have one disk... This means that a lot of data that I absolutely do not care about such as log files, swap, will be replicated consuming effort and bandwidth for meaningless data in a disaster situation. In the production systems we have different disks for OS, application, logs, swap and data so I can chose to only replicate the disks that I'm interested in as  long as the separation is done in a cleaver way. The bad partitioning from the test system would also mean that for those services that have a lot of data with a high churn i.e. those that the RPO of 15 minutes is insufficient for, would replicate all the changes even though they have to be propagated by other means.


Configure Replication - page 3
In my case we go with the Auto-assign default.

Configure Replication - page 4
Final page - review the specified options and go for it.

Recovery Plan

The recovery plan or rather the plans lets you specify how to recover from different disasters. When looking at this its not the cataclysmic disaster that's the common one rather it is the smaller ones that a set of firewalls burn up and you can't handle the traffic. Power outages in parts of the datacenter. When you analyze what can go wrong and how it affects you're business its likely that you will end up with a number of smaller scenarios. 

Furthermore you can, and most likely should, specify the start order of VM's that has been recovered.  Basically it makes sense to build the foundation before you raise the walls and put up the roof.

The recovery plan also allows you to shutdown VM's running at the recovery site to free up resources in a failover scenario.

The most notable thing about recovery plans is that the plan should be created at the recovery site to protect against cataclysmic disasters at the protected site.

References

VMware vCenter Site Recovery Manager: Concepts/Architecture


Comments

Unknown said…
This sounds like a great plan. We have been recently looking for a cloud disaster recovery service for our business. We keep hearing of all these businesses losing their data because of a storm so we are trying to take the next step to hopefully not let that happen to us. Thanks so much for your post, it was very interesting to read.
Noor Wood said…
This comment has been removed by the author.
Noor Wood said…
Thanks for sharing this, the best option for the automated disaster recovery which is a need of every business in order to run the business
smoothly and successfully.
danlighter said…
Thanks for sharing information on data disaster recovery. Disaster recovery solution is very important to protect important data in disaster situation.

Popular posts from this blog

Possible SYN flooding on port 3306 (MySQL)

Part 2 - Disaster Recovery with SRM and vSphere Replication