Availability Sets - Choosing number of fault and update domains

Chris Hocking 1 Reputation point
2021-03-11T11:58:05.78+00:00

Hello,

I have come across availability sets many times during study and have a good understanding of them. However there is one thing I've not been able to understand:
Is there a reason to not set the number of fault domains and update domains to their maximums?
Or put another way, why would you choose to have a higher proportion of your VM's in the same fault domain or in the same update domain?

Thanks

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
5,264 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Andriy Bilous 8,736 Reputation points
    2021-03-11T12:25:14.62+00:00

    Hello @Chris Hocking

    Here is an example of fault domain usage:

    • Each virtual machine in availability set is assigned an update domain and a fault domain by the underlying Azure platform. For a given availability set, five non-user-configurable update domains are assigned by default (Resource Manager deployments can then be increased to provide up to 20 update domains) to indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. When more than five virtual machines are configured within a single availability set, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time. A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.
      https://learn.microsoft.com/en-us/azure/virtual-machines/availability-set-overview#how-do-availability-sets-work

    There are some best practices:

    • Put each application tier into a separate Availability Set. In an N-tier application, don't put VMs from different tiers into the same availability set. VMs in an availability set are placed across fault domains (FDs) and update domains (UD). However, to get the redundancy benefit of FDs and UDs, every VM in the availability set must be able to handle the same client requests.
    • The availability set should have the number of fault domains set to 3 and upgrade domains should be set to 20.
      Azure supports a maximum of 3 fault domains and 20 upgrade domains. We recommend the maximum of 20 upgrade domains as that will minimize the number of nodes down at any one time.

    https://github.com/DSPN/azure-deployment-guide/blob/master/bestpractices.md
    https://learn.microsoft.com/en-us/azure/architecture/checklist/resiliency-per-service#virtual-machines


  2. kobulloc-MSFT 14,661 Reputation points Microsoft Employee
    2021-03-24T05:54:48.42+00:00

    Hello! That's a good question. If increasing the number of fault domains and update domains increases reliability, why not spread your solution out over as many as possible if there's little or no impact to your latency? While it initially makes sense to max out fault domains and update domains it may end up being more important that you set a minimum threshold instead.

    Enterprise and solutions at scale
    When you are dealing with just a handful of VMs, you have a lot of flexibility when it comes to picking the location of your VM. When you are working with a much larger solution, it may be more important that you can quickly secure a large number of VMs than it is if some of those VMs end up in the same fault domain or update domain as long as you can ensure a minimum number of fault domains and update domains to keep your solution running. In this case, a minimum threshold is more important than a maximum.

    Azure Stack, hybrid, and modular data centers
    Azure Stack is used on cruise ships, and modular data centers bring the cloud to remote areas. In these cases you don't have the luxury of a large datacenter and setting the minimum number of fault domains and update domains may be more practical than setting a maximum.

    I hope that answers your question. If you are interested, there's more reading on the subject here:
    https://learn.microsoft.com/en-us/archive/msdn-magazine/2015/september/microsoft-azure-fault-tolerance-pitfalls-and-resolutions-in-the-cloud#how-many-fault-domains

    0 comments No comments