September 2015

Volume 30 Number 9

Microsoft Azure - Fault Tolerance Pitfalls and Resolutions in the Cloud

By Mario Szpuszta | September 2015

There are several mechanisms built into Microsoft Azure to ensure services and applications remain available in the event of a failure. Such failures can include hardware failures, such as hard-disk crashes, or temporary availability issues of dependent services, such as storage or networking services. Azure and its software-controlled infrastructure are written in a way to anticipate and manage such failures.

In the event of a failure, the Azure infrastructure (the Fabric Controller) reacts immediately to restore services and infrastructure. For example, if a virtual machine (VM) fails due to a hardware failure on the physical host, the Fabric Controller moves that VM to another physical node based on the same hard disk stored in Azure storage. Azure is similarly capable of coordinating upgrades and updates in such a way as to avoid service downtime.

For computing resources (such as cloud services, traditional IaaS VMs, VM scale sets), the most important and fundamental concepts for enabling high availability are fault domains and upgrade domains. These have been part of Azure since its inception. This article will provide not so well-known clarification around those concepts. 

The Azure Datacenter Architecture

To fully understand fault domains and upgrade domains, it helps to visualize a high-level view of how Azure datacenters are structured. Azure datacenters use an architecture referred to within Microsoft as Quantum 10. It supports higher throughput compared to previous datacenter architectures. Its topology implements a full, non-blocking, meshed network that provides an aggregate backplane with a high bandwidth for each Azure datacenter, as shown in Figure 1.

High-Level Azure Datacenter Architecture
Figure 1 High-Level Azure Datacenter Architecture

The nodes are arranged into racks. A group of racks then forms a cluster. Each datacenter has numerous clusters of different types. Some clusters are responsible for Storage while others are responsible for Compute, SQL and so on. The Top-Of-Rack (TOR) switch is a single point of failure for the entire rack.

The cluster’s Fabric Controller manages all the machines or nodes. The Fabric Controller orchestrates deployments across nodes within a cluster. Every cluster has more than one Fabric Controller for fault tolerance. The Fabric Controller must be aware of the health of every node within the cluster. This helps it determine if the node can run deployments. It also helps the Fabric Controller detect failures so it can automatically heal deployments by re-provisioning the affected VMs on different physical nodes.

To better assist the Fabric Controller in determining the health of a node, every machine running within a cluster has different agents to continuously monitor node health and communicate the same back to the Fabric Controller. It’s essential to understand how the different components work together to make this happen. The important components, as shown in Figure 2, are:

  • Host OS: OS running on the physical machine.
  • Host Agent: A process running on individual nodes that provides a point of communication from that machine to the Fabric Controller.
  • Guest OS: OS running within the VM.
  • Guest Agent: Resides in the VM and communicates with Host Agent to monitor and maintain the health of the VM.

Inside the Physical Machine of a Microsoft Azure Cluster
Figure 2 Inside the Physical Machine of a Microsoft Azure Cluster

Fault Domains and Upgrade Domains

For maintaining high availability of any Platform-as-a-Service (PaaS) application, every PaaS application the Fabric Controller hosts would be spread across different fault domains and update domains.

A fault domain is a physical unit of failure. It’s closely related to the physical infrastructure in the datacenters. For Azure, every rack of servers corresponds to a fault domain. While Azure guarantees any PaaS application (with more than one instance) hosted by the platform would be available across multiple fault domains, the total number of fault domains over which the instances of the application are spread across is determined by the Fabric Controller based on availability of machines within the datacenter.

An upgrade domain is a logical unit that helps maintain application availability when you push updates to the system. For PaaS applications, this is a user-configurable setting. An application on Azure can have its instances spread across a maximum of five upgrade domains (see Figure 3).

Fault Domain and Upgrade Domain Configuration
Figure 3 Fault Domain and Upgrade Domain Configuration

Fault Domains, Upgrade Domains and IaaS VMs

To spread Infrastructure-as-a-Service (IaaS) VMs across fault domains and upgrade domains, Azure introduced the concept of Availability Sets. All instances within an availability set are spread across two or more fault domains and assigned separate upgrade domain values.

If you don’t assign VMs to an availability set, you’re not eligible for service-level agreements (SLAs) for those VMs. It’s important to understand this because it defines how you can achieve high availability for your services and applications even if failures happen or upgrades are pushed out to Azure datacenters. Only by assigning your VMs to an availability set, you can avoid being affected by such failures.

To demonstrate the importance of this, consider this scenario: The Azure product team pushes OS updates across all datacenters on a regular basis. To push updates to the entire datacenter, the host OS (physical machines) and the guest OS (VMs hosting PaaS applications or your own IaaS VMs) must be updated to the latest OS. To roll out the updates without affecting availability of the applications:

  1. The host OS updates are performed across the datacenter one fault domain at a time, for all available fault domains.
  2. Guest OS updates are performed on every user application one upgrade domain at a time, for all the available upgrade domains.

With these approaches, Azure can push upgrades to its own infrastructure while maintaining service availability—as long as you run at least two instances per service or at least two VMs as part of an availability set (such as a load-balanced Web service, SQL Server AlwaysOn Availability Group Nodes and so on).

How Many Fault Domains?

Fault and upgrade domains help maintain availability when it comes to PaaS-like workloads, which are mostly stateless. Stateless Web applications are compatible with these approaches. Even if a subset of the nodes becomes unavailable during upgrade cycles or temporary downtimes, the overall Web application or service remains available.

The situation gets trickier when it comes to infrastructure of a more stateful nature, such as database servers (be it RDBMS or NoSQL). In these cases, knowing your servers are spread across multiple fault domains might not be enough. A database cluster might require a minimum number of nodes to be up at all times for cluster health. Consider Quorum-based approaches for electing new master nodes in case of failures.

For IaaS VMs, Azure guarantees VMs within the same availability set will be deployed on at least two fault domains (therefore two racks). While there’s some probability VMs within an availability set are deployed across more than two fault domains, there’s no guarantee. In practical testing over the past few years, the project always used exactly two fault domains when deploying to North or West Europe, as well as several U.S. regions (as shown in Figure 4).

Fault Domains for a Sample SQL AlwaysOn AG Cluster
Figure 4 Fault Domains for a Sample SQL AlwaysOn AG Cluster

Figure 4 shows the result of a Get-AzureVM command issued through Azure PowerShell, which then displays the result in its GridView control. It shows the VMs within the cluster—which are all part of the same availability set (not shown in the grid)—are deployed across two fault domains and three upgrade domains.

The sql1 and sqlwitness nodes of that sample deployment reside on the same physical rack. The same TOR connects that rack to the rest of the datacenter. The sql2 node sits on a different rack.

Only Two Fault Domains?

For stateless applications such as Web APIs or Web applications, being deployed across only two fault domains shouldn’t be a problem, at least from a consistency perspective. For stateful workloads such as database servers, it’s a different story, at least from an availability perspective.

Depending on how a cluster works, it could be important to know how many nodes can go down in case of a failure before it affects the cluster’s health. If a cluster depends on quorum votes or majority-based votes for certain operations, such as electing new masters or confirming consistency for read requests, the question of how many nodes can go down in a worst-case scenario is more important.

Even though the fault domain automatically recovers your VMs, the question of how many nodes could fail at the same time is relevant. A recovery operation might take time depending on how long it takes the Fabric Controller to recover the VM itself and the database system running on the VM.

The whole topic becomes more important when you understand the internal behavior of Azure automatism in upgrades. In the case of IaaS VMs, all upgrades to the host OS running on the hypervisor on each of the physical nodes that host your VMs happens based on fault domains—not upgrade domains as assumed in the broad developer community. Upgrade domains are only used for updating applications running inside PaaS VMs. That means you’ll be affected by host OS upgrades every quarter, which is the typical interval. If your cluster is deployed across two fault domains, but depends on majority votes and the like, you could have intermittent downtime.

A common example of this is deploying a MongoDB replica set in Azure. Each MongoDB replica set requires exactly one master. If that master node fails, a new master is elected through the remaining nodes. That election requires a majority of votes for electing the master. If not enough nodes are up to vote a new master, the whole replica set is declared an unhealthy state and can be considered as “down.”

The MongoDB documentation (bit.ly/1SxKrYI) clearly states a fault-tolerance per replica set size. Only one node can fail in a replica set of three nodes. In a set of five nodes, two is the maxi­mum number of nodes that can fail without the whole cluster going down, as shown in Figure 5.

Figure 5 MongoDB Replica Set Fault Tolerance

Number of Members Majority Required to Elect a New Primary Fault Tolerance
3 2 1
4 3 1
5 3 2
6 4 2

If the number of database nodes you need to run in a replica set is even (for example, two database nodes for a replica set), MongoDB introduced the concept of an arbiter. An arbiter acts as a voting server for elections, but doesn’t run the entire database stack (for saving costs and resources). So if you end up with a MongoDB replica set where two database nodes are sufficient, you need a third node—the arbiter—which is only there to provide an additional vote for majority-based master elections in case of failures.

It’s a similar situation with the SQL Server AlwaysOn Availability Group, where a majority of nodes is required to elect a new primary node. The principle of a voting-only member is similar. It’s just called a witness in the world of SQL Server (instead of arbiter, as they’re called in MongoDB).

Considering SQL Server and the deployment shown back in Figure 4, it clearly states sql1 and sqlwitness are on one fault domain and sql2 is on another. If fault domain “0” fails, the master and witness are both down—only sql2 is left. However, sql2 alone is an insufficient majority for electing a new master in the cluster. That means if fault domain “0” fails, your whole cluster is unhealthy.

The situation would be worse if sql1 and sql2 ended up on the same fault domain. Then both database nodes would be down until the Fabric Controller recovered the nodes from a potential failure or completed the host OS upgrade process.

The situation is similar with a MongoDB replica set. The table shown in Figure 5 from the official MongoDB docs clearly states that in a replica set of three nodes, only one node can fail for the whole cluster to remain active and available. Therefore, it’s critical how your nodes are spread across a given number of fault domains. It can affect you in both Azure host OS upgrades and potential failures.

High Availability in Stateful Services

A valid question, then, is how you can achieve high availability when Azure mostly deploys VMs across two fault domains. There are mid-term and short-term answers to this question.

Mid-term the Azure product group is working to improve the situation dramatically. When you deploy version 2 IaaS VMs (based on the new Azure Resource Manager API), Azure can deploy your workloads across a minimum of three fault domains. That’s a good reason to use a version 2 VM and the Azure Resource Manager.

Short-term or as long as you’re still dependent on traditional Azure Service Management and version 1 IaaS VMs, it’s not that simple. Depending on your SLA, recovery time objective (RTO) and recovery point objective (RPO) targets you have two options that reduce the risk of downtimes. Both approaches are shown in Figure 6, and are based on a SQL Server AlwaysOn Availability Group.

SQL Server AlwaysOn Availability Group Deployment
Figure 6 SQL Server AlwaysOn Availability Group Deployment

The target is always to reduce the impact of both regularly occurring Host OS upgrades and the eventual failure. For a three-node cluster in a single datacenter, deploy one node outside the VM availability set and two nodes as part of the same VM availability set.

The effects of host OS upgrades are nearly completely mitigated with that approach. The timing of upgrades of VMs inside availability sets is different from single VM host OS upgrades. Host OSes running single VMs without availability sets are typically upgraded approximately a week earlier than those with VMs in availability sets.

In the case of fault domain failures, you can only reduce the probability of being affected. There’s always a probability the node outside the availability set lands on one of the fault domains of the VMs within the availability set. Much depends on the resources consumed and available in an Azure datacenter.

For an Active Directory Domain as in Figure 6, there’s no need for such a solution. There’s only a primary and backup domain controller for high availability, anyway. That’s two nodes perfectly spread across two fault domains, which is what Azure guarantees for version 1 IaaS VMs.

That still leaves you with the challenge of being better prepared for the SQL nodes based on Figure 6. You can reduce that risk by not deploying sqlwitness in the same datacenter. That doesn’t just reduce the probability; it essentially eliminates the risk. That solution is also expressed in Figure 6: Distribute your deployment across two regions.

Two Options

Depending on your SLA, RPO and RTO needs and what you’re willing to pay for high availability, again you have two major options: The fully functional backup deployment in a second region, or having just the arbiter/witness in the secondary region.

A fully functional secondary deployment means you replicate your entire deployment in a secondary region. That would also include your front-end and middle-tier applications and services. In the event of a major failure in the primary region, you could then redirect customers to the secondary region.

Such deployments are typically built for strong RPO/RTO targets. For database systems such as MongoDB or SQL AlwaysOn with short RPOs and RTOs, such needs typically result in spanning the replica set or SQL AG cluster across two regions with ongoing replication enabled across those regions. Although the replication across regions will probably be asynchronous due to latency and performance issues, replication will happen in anywhere from milliseconds to minutes, as opposed to double-digit minutes or hours.

On the other hand, running just a witness or arbiter in the secondary region as a single VM is a much cheaper alternative. It’s good enough when you just need to keep your primary cluster alive in case of fault domain failure. It doesn’t give you the option of immediately failing over to an entire secondary region without some serious additional steps, such as spinning up new nodes and VMs in the secondary region.

In the example shown in Figure 6, you could also run a full SQL node in the secondary region as the only node. Because it runs as a single VM, it would have different upgrade cycles. Also, because it runs in a different datacenter, the probability of it being upgraded or failing at exactly the same time as the nodes in the primary availability set is low.

Wrapping Up

Achieving high availability and fault tolerance for your applications and services isn’t a simple process. It requires understanding and adjusting to fundamental concepts. You need to understand fault domains, upgrade domains and availability sets. You especially need to understand the fault-tolerance requirements of stateful systems you’re using in your infrastructure when moving to Azure. Map those fault-tolerance requirements to behaviors of fault domains and upgrade domains in Azure.

For entirely new IaaS deployments, be sure to leverage IaaS VMs  v2 as part of the Azure Resource Manager and Resource Group efforts. That way, you’ll benefit from the fault tolerance of being deployed on at least three fault domains. For deployments using traditional service management, make sure you understand and embrace the realities outlined in this article. These suggestions can help reduce impacts of fault domain downtimes and maintenance events such as Host OS Upgrades. By embracing and adjusting to the concepts outlined here, you’ll be able to achieve high availability without unwanted surprises.


Mario Szpuszta *is a principal program manager for the DX Corp. Global ISV team. He works with independent software vendors across the world to enable their solutions and services on Microsoft Azure. You can reach him through his blog (blog.mszcool.com), on Twitter (twitter.com/mszcool) or via marioszp@microsoft.com.   *

Srikumar Vaitinadin is a software development engineer for the DX Corp. Global ISV team. Before that he was involved in migrating Microsoft properties to Azure. He and his team of architects played a major role in onboarding Azure China and Azure federal government clouds. You can get in touch with him via email at srivaiti@microsoft.com.

Thanks to the following technical experts for reviewing this article: Guadalupe Casuso and Jeremiah Talkar
Guada Casuso is a technology evangelist working for Microsoft Azure and experienced in Cloud and Cross Platform Mobile development. When she’s not working, she is in a paddleboard or flying drones. Guada shares her thoughts in her blog at atomosybitsenlanube.net and on Twitter at twitter.com/guadacasuso.