High Availability, Disaster Recovery, and Windows Azure

Article
03/03/2014

Both High Availability (HA) and Disaster Recovery (DR) have been essential IT topics. Fundamentally HA is about fault tolerance relevant to the hardware and the software of an examined application, while DR roots on the ability to resume operations in the aftermath of a catastrophic event. For many IT shops, both HA and DR have been high risk and high cost items. Either one requires to solve touch technical problems with very significant and long-term commitments on resources. Not only they are technically challenging, but a continual cost-cutting which has become an IT standard practice in the past two decades often makes HA and DR further distant from IT’s financial reality.

Too often, the technical challenges and resource commitments overwhelm IT and turn HA and DR into academic discussions, or symbolic items on a project checklist. And many businesses have learned to survive without a viable HA or DR solution. At the same time, while information rapidly exploding there are progressively more data processed, in transit, and stored due to mobility and the increasing adoption of social media. For many businesses, the needs for HA and DR become real and urgent for managing risks. And a technically sound and financially affordable HA and DR solution are becoming increasingly critical.

The good news is that the recent introduction of cloud computing has fundamentally changed how an HA or DR solution can be implemented. Windows Azure is a vivid example of offering capabilities which significantly simplify the process and reduce the complexities of implementing HA and DR for IT pros. The traditional approach by establishing redundancy and acquiring a physical DR site with long-term resources and financial commitments is now largely replaced with consumable services with a cost structure based on usage. HA and DR are now much more realistic and within reach for businesses at all sizes.

HA, Redundancy, and Windows Azure LRS

HA is to eliminate a single point of failure of an examined component. It denotes a strategy to introduce redundancy such that a target application instance can and will continue without downtime while experiencing a failure of hardware or software. There are various and well-developed HA solutions like a hyper-V host cluster for eliminating a single point of failure of hardware and a guest VM cluster for a single point of failure of a VM instance. Although HA implementations may vary, the fundamental principle nevertheless remains the same. HA is to establish fault tolerance by employing redundancy.

HA becomes something dramatically simple in Windows Azure. Basically, all data written to disk in Windows Azure are kept in the so-called LRS, Locally Redundant Storage. LRS replicates a transaction synchronously to three different storage nodes across fault domains and upgrade domains within the same region for durability. In layman’s terms, Windows Azure by default maintains three copies of user data to ensure HA.

In other words, if to deploy a database server to Windows Azure, the data stored in Windows Azure are by default highly available upon deployment. For Microsoft SQL Server deployment, multiple server instances can also become a high availability SQL Server system with AlwaysOn. In this case, a Windows Azure subscription is all needed to configure, deploy, test, and roll out a high availability system.

DR, Replication, and Windows Azure GRS

DR is about having a plan and backups in place to resume operations in the aftermath of a catastrophic event. Outage is assumed in a DR scenario, therefore the real-time integrity is not expected. This is not exactly the same with HA which is to maintain real-time integrity and eliminate outage. They are different business problems and addressed differently. DR uses replications or backups; although either is a form of redundancy, real-time integrity between the source and a replica or a backup is not expected.

For a critical workload, one approach of DR is to establish geo-replication to address an outage over a geographic area caused by a natural disaster, for example. The idea is that a catastrophic event may impact an entire geographic area causing a datacenter where a mission critical application is being hosted.

In Windows Azure, geo-replication of data is optional and a default setting, as shown above, to enable Geo Redundant Storage or GRS while configuring a storage account. GRS once selected will queue a transaction committed to LRS as an asynchronous replication to a secondary region, at least 400 miles away from the primary region, i.e. the source. At the secondary site, data is stored in LRS, i.e. made durable by replicating it to three storage nodes.

Specifically, a Windows Azure storage account is by default configured with GRS, which in effect maintains three copies of the same content locally for high availability, and replicates the content to a secondary datacenter at least 400 miles away and maintains three copies there for DR. So all are six copies, three locally and three remotely.

A GRS setting has little performance impact on an application since application data are committed to LRS in real-time while replication to GRS is queued. The cost implication includes the storage and the transmission cost for egress traffic, as applicable, of the secondary datacenter. Ingress traffic is free for Windows Azure. And Windows Azure Storage SLA offers 99.9% availability and a cost calculator is also available.

Windows Azure Recovery Services

So far, much is about backing up or replicating data. To successfully restore, a DR plan must be put in place and ensure its availability and correctness. Windows Azure is an idea choice in this case.

Hyper-V Recovery Manager (HVRM)

This component is essentially acting as the director of a DR process. It orchestrates and manages the protection and failover of Hyper-V VMs deployed to System Center 2012 SP1 or later Virtual Machine Manager based clouds. Once HVRM is configured, VMs are replicated with Hyper-V Replica. A noticeable advantage of HVRM is the ability to test a recovery configuration, exercise a proactive failover and recovery, and automate recovery in the event of a site outage. The SLA offers HVRM with 99.9% availability to ensure when needed, a configured DR plan is always in place with expected updates.

Backup to Windows Azure

Backing up servers and contents to cloud has become a simple routine with Windows Azure Backup service. Like other secure communication with Windows Azure, you will first upload a public certificate to Windows Azure. Then download the backup agent to register a target server with the backup vault. Then select what to be backed up. Both Windows Azure Backup SLA and cost calculator are available to better assess the solution.

Hyper-V Extended Replication

Although this feature is not directly associated with Windows Azure, it is relevant to DR and included for completeness.

Hyper-V Replica is introduced in Windows Server 2012 and a DR solution for small and medium businesses. Extended Replication is a feature in Windows Server 2012 R2 to better fit business needs. Essentially, this is to configure a backup’s backup. The consideration is that some prefer to have both the source and a backup (or replica) in direct control and perhaps located in relatively close proximity. And there may be a need for having a third copy of the backup stored in a geo-graphically remote location in case of the first two copies both become unavailable is a DR scenario.

The process to set up Extended Replication is to first establish the replication between a primary server and replica server. Once completed, go to the replica server and from the Hyper-V manager select a replicated VM for which you want to extend the replication. The following is a sample of setting up VM01 for Extended Replication.

Windows Azure Trial Subscription

For those who would like to acquire a free 30-day trial subscription and assess Windows Azure for HA and DR solutions, go to https://aka.ms/R2 and click the dropdown list to select Windows Server 2012 R2 Datacenter on Windows Azure. This will kick off a registration process.

Closing Thoughts

Form an application’s view, HA is an on-going event while DR is an anticipation. HA and DR are different business problems and should be addressed differently. Nevertheless, Windows Azure provides a single platform to address HA with LRS, DR with GRS, and DR orchestration with HVRM, and all with published SLA and a predictable cost structure. Going forward, IT pros can now employ HA and DR as viable solutions while including Windows Azure as a solution platform.