Disaster Recovery

While the Azure platform provides high availability within a single data center, as discussed in the previous post, it currently does not explicitly support or enable disaster recovery or geographically distributed high availability. This post and the next one will discuss how the service developer can provide these two capabilities, respectively.

Disaster Recovery

Disaster recovery typically involves a pair of processes called fail over and fail back. Both involve the use of redundant service components. A fail over moves service load from one set of components to another. In the case of an actual outage, a fail over is performed because the original components have failed. However, planned fail overs may be performed for testing, upgrade, or other purposes. Because most issues within a single data center are addressed by the high availability features of SQL Azure and Windows Azure, fail over will generally be used to move load between data centers. A fail back restores the original service load distribution.

For stateful services, these processes target both compute and storage components. For stateless services, only compute components are targeted. Compute fail overs reroute client requests to different compute components, usually by changing DNS CNAME records to bind a given logical URI to different physical URIs. Storage fail overs cause service requests to be served from different storage components, typically by changing connection strings used internally by the service.

Both processes require redundant service components to be available and ready to receive the service load being moved. Satisfying this requirement can be challenging. Provisioning redundant components in advance may incur additional cost, for example, but provisioning them on demand during a fail over may incur unacceptable latency. There is also the question of how to make sure the redundant components are correctly deployed and configured, so that the service continues to behave correctly following the move.

Deployment Units

One way to simplify the task of providing redundant service components is to use deployment units. A deployment unit is a set of resources provisioned in advance and configured to support the execution of an instance of a given service. Conceptually, it's like a template for a service instance. For example, if a given service had two web roles, three worker roles and three databases before loading production data, then a deployment unit for that service might consist of five hosted services with deployed exectuables, and one SQL Azure logical sever containing three databases configured with appropriate schema and initial data.

Configuring a scale unit involves a variety of tasks, such as provisioning certificates, configuring DNS, setting configuration parameters in application and library configuration files, building executables from the configured sources, setting configuration parameters in deployment configuration files, building deployable packages from the executables and deployment settings, deploying the packages to the hosted services, and running SQL scripts to configure the databases.

The process should be automated, since the number of configuration parameters may be too large to reliably set by hand, and some parts of it may have to run in Azure, since some of the configuration parameters, such as SQL Azure connection strings and other credentials may have to be sourced from secure locations rather than developer machines, or by operations staff, rather than developers. Of course, the ability to rapidly provision configured deployment units provides significant value beyond disaster recovery. Deployment units can be used for development, test, staging, private releases, a/b testing and upgrade scenarios.

The key to success with deployment units is systematic organization. When provisioning a deployment unit is a merely matter of running a tool, it doesn't take long before the number of hosted services and databases gets out of hand, and it can quickly become challenging to track which deployment units have been provisioned, which resources each one is using, and what purpose each one is serving. A key step toward service maturity is therefore building a deployment unit management system, with systematic naming of deployment units and the resources they contain. Typically, the DNS names of the hosted services will reflect the organizational structure, identifying the name of the component, the name of the deployment unit, and the name of the data center in which it resides, for example, making it easy to identify components programmatically for fail over and fail back operations.

Compute Redundancy

With Windows Azure, a cost effective compute redundancy strategy is to operate redundant service instances in two or more data centers, and to fail over by moving load from one instance to another. Because the platform varies the number of worker roles running as the load varies, this approach can be used to add capacity without incurring additional cost because the additional worker roles will not run until the load actually moves. This is called an active/active configuration, as opposed to an active/passive configuration, where the service instance in one data center does not carry any load until a fail over occurs.

Storage Redundancy

Storage fail overs are more complex than compute fail overs because of the challenge of maintaining consistency among copies of the data when the data is changing. Because of the time required to copy data between locations, there is a possibility of data loss if a location fails or is isolated from other locations while holding data that has not yet been replicated. On the other hand, forcing synchronous replication to multiple locations when the data is written can result in poor performance due to added latency. In the extreme, when too few locations are available, forcing synchronous replication causes writes to block until additional locations become available.

This trade-off, known as the CAP theorem[1], is the motivation behind the many technologies for managing geographically distributed data used on the Internet. The two major categories of distributed data technology are as follows:

  • Fully consistent stores, such as Google’s MegaStore[2], use synchronous replication based on distributed algorithms like PAXOS, to write to a quorum of replicas. Synchronous replication avoids inconsistency, but is generally too slow to be used for high throughput, low latency applications. Also, to prevent loss of service and resynchronization issues, a quorum of three or more locations is typically required. Currently, on the Azure platform, there are only two data centers in every region, making it hard to build quorums with reasonable performance.
  • Eventually consistent stores, such as Amazon’s Dynamo[3], use asynchronous replication to copy data to replicas after it is written to a primary location, and provide mechanisms for dealing with data loss and inconsistency. Without an eventually consistent store, service developers must deal with inconsistencies in business logic or use conflict resolution mechanisms to detect and resolve them.

Asynchronous Replication

In a multi-master architecture, writes can occur in multiple locations. Continuous conflict resolution is required to resolve inconsistencies. In a single-master architecture, writes can occur in one location. However, conflict resolution is still required if the master is allowed to change, as it must in the case of a fail over, and must be performed after the master changes to build a consistent view of service state. An alternative to conflict resolution is to flatten and rebuild from scratch any replicas holding a later copy of the data than the new master.

On the Azure platform, one of the best bets for achieving asynchronous replication is the Sync Framework, which handles the complexities of virtual clocks, knowledge vectors and difference computation, while providing enormous flexibility through the use of confict resolution polcies, application level conflict resolution, and a provider architecture for data and metadata sources. The SQL Azure Data Sync Service, now in CTP, hosts the Sync Framework on Windows Azure to provide synchronization as a service. Using the service, instead of rolling your own solution with the Sync Framework means living with some limitations, but also offloads the work of building, maintaining and operating an asynchronous replication service. Check this blog for more on these two technologis and the trade off between them in upcoming posts.

In the extreme, asynchronous replication becomes backup and restore, which offers a simple form of storage redundancy, if customers are willing to accept an RPO defined by the window between backups. If a service cannot block writes for extended periods of time, then data may change while a backup is taken, and the contents of different storage components captured by the backup may reflect different views of the state of the service. In other words, backups may also contain inconsistencies, and recovery processing is therefore generally required following a restore.

Some services use data mastered by other services, and update their copies when data changes at the source. In these cases, recovery processing involves replaying changes from the source to ensure that the copies are up to date. All updates must be idempotent, so that the data to be replayed can overlap with the data already stored in the copies to ensure that no updates are missed.

Cross Data Center Operation

While compute and storage fail overs generally occur in tandem, so that the entire service fails over from one data center to another, there are situations where it may make sense to fail over one but not the other. Since storage fail over may cause data loss, for example, it may make sense to fail over only the compute components, allowing them to access the primary storage facilities, assuming they’re still accessible.

Following a compute only fail over, compute and storage components may be located in two different data centers. Storage access calls will therefore traverse the backbone network between data centers, instead of staying within a single data center, causing performance degradation. One the Azure platform, round trip times between data centers within a region are about 6 times higher than they are within a data center. Between regions, the performance penalty grows to a factor of 30. Services will also incur data egress costs when running across data centers. Cross data center operation really only makes sense for short lived outages, and/or for services that only move small amounts of data.


[1]
See https://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf.

[2]
See https://www.cidrdb.org/cidr2011/Papers/CIDR11\_Paper32.pdf.

[3]
See https://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.