Reliability in Virtual Machine Scale Sets

This article contains specific reliability recommendations and information on availability zones support for Virtual Machine Scale Sets.

Note

Virtual Machine Scale Sets can only be deployed into one region. If you want to deploy VMs across multiple regions, see Virtual Machines-Disaster recovery: cross-region failover.

For an architectural overview of reliability in Azure, see Azure reliability.

Reliability recommendations

This section contains recommendations for achieving resiliency and availability. Each recommendation falls into one of two categories:

  • Health items cover areas such as configuration items and the proper function of the major components that make up your Azure Workload, such as Azure Resource configuration settings, dependencies on other services, and so on.

  • Risk items cover areas such as availability and recovery requirements, testing, monitoring, deployment, and other items that, if left unresolved, increase the chances of problems in the environment.

Reliability recommendations priority matrix

Each recommendation is marked in accordance with the following priority matrix:

Image Priority Description
High Immediate fix needed.
Medium Fix within 3-6 months.
Low Needs to be reviewed.

Reliability recommendations summary

Category Priority Recommendation
High Availability Enable automatic repair policy
Deploy Virtual Machine Scale Sets across availability zones with Virtual Machine Scale Sets Flex
Scalability VMSS-1: Deploy VMs with flexible orchestration mode
Configure Virtual Machine Scale Sets Autoscale to Automatic
Set Virtual Machine Scale Sets custom scale-in policies to default
Disaster Recovery Enable Protection Policy for all Virtual Machine Scale Set VMs
Monitoring Enable Virtual Machine Scale Sets application health monitoring
System Efficiency Configure Allocation Policy Spreading algorithm to max spreading
Automation Set patch orchestration options to Azure-orchestrated

High availability

Enable automatic repair policy

To achieve high availability for applications, enable automatic instance repairs to maintain a set of healthy VMs. When the Application Health extension or Load Balancer health probes find that an instance is unhealthy, automatic instance repair deletes the unhealthy instance and creates a new one to replace it.

A grace period can be set using the property automaticRepairsPolicy.gracePeriod. The grace period, specified in minutes and in ISO 8601 format, can range between 10 to 90 minutes, and has a default value of 30 minutes.

// Azure Resource Graph Query
// Find VMSS instances associated with autoscale settings when autoscale is disabled
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| project name, id, tags
| join kind=leftouter  (
    resources
    | where type == "microsoft.insights/autoscalesettings"
    | where tostring(properties.targetResourceUri) contains "Microsoft.Compute/virtualMachineScaleSets"
    | project id = tostring(properties.targetResourceUri), autoscalesettings = properties
) on id
| where isnull(autoscalesettings) or autoscalesettings.enabled == "false"
| project recommendationId = "vmss-4", name, id, tags, param1 = "autoscalesettings: Disabled"
| order by id asc

Deploy Virtual Machine Scale Sets across availability zones with Virtual Machine Scale Sets Flex

When you create your Virtual Machine Scale Sets, use availability zones to protect your applications and data against unlikely datacenter failure. For more information, see Availability zone support.

// Azure Resource Graph Query
// Find VMSS instances with one or no Zones selected
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| where array_length(zones) <= 1 or isnull(zones)
| project recommendationId = "vmss-8", name, id, tags, param1 = "AvailabilityZones: Single Zone"
| order by id asc

Scalability

Deploy VMs with flexible orchestration mode

All VMs, including single instance VMs, should be deployed into a scale set using flexible orchestration mode to future-proof your application for scaling and availability. Flexible orchestration offers high availability guarantees (up to 1000 VMs) by spreading VMs across fault domains in a region or within an availability zone.

For more information on how to use scale sets appropriately, see When to use Virtual Machine Scale Sets instead of VMs

// Azure Resource Graph Query
// Find all zonal VMs that are NOT deployed with Flex orchestration mode
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| where properties.orchestrationMode != "Flexible"
| project recommendationId = "vmss-1", name, id, tags, param1 = strcat("orchestrationMode: ", tostring(properties.orchestrationMode))

Configure Virtual Machine Scale Sets Autoscale to Automatic

Autoscale is a built-in feature of Azure Monitor that helps the performance and cost-effectiveness of your resources by adding and removing scale set VMs based on demand. In addition, you can choose to scale your resources manually to a specific instance count or in accordance with metrics thresholds. You can also schedule instance counts that scale during designated time windows.

To learn how to enable automatic OS image upgrades, see Azure Virtual Machine Scale Set automatic OS image upgrades.

// Azure Resource Graph Query
// Find VMSS instances associated with autoscale settings when predictiveAutoscalePolicy_scaleMode is disabled
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| project name, id, tags
| join kind=leftouter  (
    resources
    | where type == "microsoft.insights/autoscalesettings"
    | where tostring(properties.targetResourceUri) contains "Microsoft.Compute/virtualMachineScaleSets"
    | project id = tostring(properties.targetResourceUri), autoscalesettings = properties
) on id
| where autoscalesettings.enabled == "true" and autoscalesettings.predictiveAutoscalePolicy.scaleMode == "Disabled"
| project recommendationId = "vmss-5", name, id, tags, param1 = "predictiveAutoscalePolicy_scaleMode: Disabled"
| order by id asc

Set Virtual Machine Scale Sets custom scale-in policies to default

The Virtual Machine Scale Sets custom scale-in policy feature gives you a way to configure the order in which virtual machines are scaled-in. There are three scale-in policy configurations:

A Virtual Machine Scale Set deployment can be scaled-out or scaled-in based on an array of metrics, including platform and user-defined custom metrics. While a scale-out creates new virtual machines based on the scale set model, a scale-in affects running virtual machines that may have different configurations and/or functions as the scale set workload evolves.

It's not necessary that you specify a scale-in policy if you only want the default ordering to be followed, as the default custom scale-in policy provides the best algorithm and flexibility for most of the scenarios. The default ordering is as follows:

  1. Balance virtual machines across availability zones (if the scale set is deployed with availability zone support).
  2. Balance virtual machines across fault domains (best effort).
  3. Delete virtual machine with the highest instance ID.

Only use the Newest and Oldest policies when your workload requires that the oldest or newest VMs should be deleted after balancing across availability zones.

Note

Balancing across availability zones or fault domains doesn't move VMs across availability zones or fault domains. The balancing is achieved through the deletion of virtual machines from the unbalanced availability zones or fault domains until the distribution of virtual machines becomes balanced.

// Azure Resource Graph Query
// Find VMSS instances where strictly zoneBalance is set to True
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| where properties.orchestrationMode == "Uniform" and properties.zoneBalance == true
| project recommendationId = "vmss-6", name, id, tags, param1 = "strictly zoneBalance: Enabled"
| order by id asc

Disaster recovery

Enable Protection Policy for all Virtual Machine Scale Set VMs

Use Virtual Machine Scale Sets Protection Policy if you want specific VMs to be treated differently from the rest of the scale set instance.

As your application processes traffic, there can be situations where you want specific VMs to be treated differently from the rest of the scale set instance. For example, certain VMs in the scale set could be performing long-running operations, and you don’t want these VMs to be scaled-in until the operations complete. You might also have specialized a few VMs in the scale set to perform different tasks than other members of the scale set. You require these special VMs not to be modified with the other VMs in the scale set. Instance protection provides the extra controls to enable these and other scenarios for your application.

// Azure Resource Graph Query
// Find all VMs that do NOT have health monitoring enabled
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| join kind=leftouter  (
    resources
    | where type == "microsoft.compute/virtualmachinescalesets"
    | mv-expand extension=properties.virtualMachineProfile.extensionProfile.extensions
    | where extension.properties.type in ( "ApplicationHealthWindows", "ApplicationHealthLinux" )
    | project id
) on id
| where id1 == ""
| project recommendationId = "vmss-2", name, id, tags, param1 = "extension: null"

Monitoring

Enable Virtual Machine Scale Sets application health monitoring

Monitoring your application health is an important signal for managing and upgrading your deployment. Azure Virtual Machine Scale Sets provides support for rolling upgrades, including:

// Azure Resource Graph Query
// Find all VMs that do NOT have automatic repair policy enabled
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| where properties.automaticRepairsPolicy.enabled == false
| project recommendationId = "vmss-3", name, id, tags, param1 = "automaticRepairsPolicy: Disabled"

System Efficiency

Configure Allocation Policy Spreading algorithm to max spreading

With max spreading, the scale set spreads your VMs across as many fault domains as possible within each zone. This spreading could be across greater or fewer than five fault domains per zone. With static fixed spreading, the scale set spreads your VMs across exactly five fault domains per zone. If the scale set can't find five distinct fault domains per zone to satisfy the allocation request, the request fails.

For more information, see Spreading options.

// Azure Resource Graph Query
// Find VMSS instances where Spreading algorithm is set to Static
resources
| where type == "microsoft.compute/virtualmachinescalesets"
| where properties.platformFaultDomainCount > 1
| project recommendationId = "vmss-7", name, id, tags, param1 = "platformFaultDomainCount: Static"
| order by id asc

Automation

Set patch orchestration options to Azure-orchestrated

Enable automatic VM guest patching for your Azure VMs. Automatic VM guest patching helps ease update management by safely and automatically patching VMs to maintain security compliance, while limiting the blast radius of VMs.

resources
| where type == "microsoft.compute/virtualmachinescalesets"
| join kind=inner (
    resources
    | where type == "microsoft.compute/virtualmachines"
    | project id = tostring(properties.virtualMachineScaleSet.id), vmproperties = properties
) on id
| extend recommendationId = "vmss-9", param1 = "patchMode: Manual", vmproperties.osProfile.linuxConfiguration.patchSettings.patchMode
| where isnotnull(vmproperties.osProfile.linuxConfiguration) and vmproperties.osProfile.linuxConfiguration.patchSettings.patchMode !in ("AutomaticByPlatform", "AutomaticByOS")
| distinct recommendationId, name, id, param1
| union (resources
| where type == "microsoft.compute/virtualmachinescalesets"
| join kind=inner (
    resources
    | where type == "microsoft.compute/virtualmachines"
    | project id = tostring(properties.virtualMachineScaleSet.id), vmproperties = properties
) on id
| extend recommendationId = "vmss-9", param1 = "patchMode: Manual", vmproperties.osProfile.windowsConfiguration.patchSettings.patchMode
| where isnotnull(vmproperties.osProfile.windowsConfiguration) and vmproperties.osProfile.windowsConfiguration.patchSettings.patchMode !in ("AutomaticByPlatform", "AutomaticByOS")
| distinct recommendationId, name, id, param1)

Availability zone support

Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones.

Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see Regions and availability zones.

Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see Recommendations for using availability zones and regions.

With Azure Virtual Machine Scale Sets, you can create and manage a group of load balanced VMs. The number of VMs can automatically increase or decrease in response to demand or a defined schedule. Scale sets provide high availability to your applications, and allow you to centrally manage, configure, and update many VMs. There's no cost for the scale set itself. You only pay for each VM instance that you create.

Virtual Machine Scale Sets supports both zonal and zone-redundant deployments within a region:

  • Zonal deployment. When you create a scale set in a single zone, you control which zone all the VMs of that set run in. The scale set is managed and autoscales only within that zone.

  • Zone-redundant deployment. A zone-redundant scale set lets you create a single scale set that spans multiple zones. By default, as VMs are created, they're evenly balanced across zones.

Prerequisites

  1. To use availability zones, your scale set must be created in a supported Azure region.

  2. All VMs - even single instance VMs - should be deployed into a scale set using flexible orchestration mode to future-proof your application for scaling and availability.

SLA

Because availability zones are physically separate and provide distinct power sources, network, and cooling - service-level agreements (SLAs) are increased. For more information, see the SLA for Microsoft Online Services.

Create a Virtual Machine Scale Set with availability zones enabled

You can create a scale set that uses availability zones with one of the following methods:

The process to create a scale set that uses a zonal deployment is the same as detailed in the getting started article. When you select a supported Azure region, you can create a scale set in one or more available zones, as shown in the following example:

Create a scale set in a single availability zone

The scale set and supporting resources, such as the Azure load balancer and public IP address, are created in the single zone that you specify.

Zonal failover support

Virtual Machine Scale Sets are created with five fault domains by default in Azure regions with no zones. For the regions that support availability zone deployment of Virtual Machine Scale Sets and this option is selected, the default value of the fault domain count is 1 for each of the zones. In this case, FD=1 implies that the VM instances belonging to the scale set are spread across many racks on a best effort basis. For more information, see Choosing the right number of fault domains for Virtual Machine Scale Set.

Low-latency design

It's recommended that you configure Virtual Machine Scale Sets with zone-redundancy. However, if your application has strict low latency requirements, you may need to implement a zonal for your scale sets VMs. With a zonal scale sets deployment, it's recommended that you create multiple scale set VMs across more than one zone. For example, you can create one scale sets instance that's pinned to zone 1 and one instance pinned to zone 2 or 3. You also need to use a load balancer or other application logic to direct traffic to the appropriate scale sets during a zone outage.

Important

If you opt out of zone-aware deployment, you forego protection from isolation of underlying faults. Opting out from availability zone configuration forces reliance on resources that don't obey zone placement and separation (including underlying dependencies of these resources). These resources shouldn't be expected to survive zone-down scenarios. Solutions that leverage such resources should define a disaster recovery strategy and configure a recovery of the solution in another region.

Safe deployment techniques

To have more control over where you deploy your VMs, you should deploy zonal, instead of regional, scale set VMs. However, zonal VMs only provide zone isolation and not zone redundancy. To achieve full zone-redundancy with zonal VMs, there should be two or more VMs across different zones.

It's also recommended that you use the max spreading deployment option for your zone-redundant VMs. For more information, see the spreading options.

Spreading options

When you deploy a scale set into one or more availability zones, you have the following spreading options (as of API version 2017-12-01):

  • Max spreading (platformFaultDomainCount = 1). Max spreading is the recommended deployment option, as it provides the best spreading in most cases. If you to spread replicas across distinct hardware isolation units, it's recommended that you spread across availability zones and utilize max spreading within each zone.

    With max spreading, the scale set spreads your VMs across as many fault domains as possible within each zone. This spreading could be across greater or fewer than five fault domains per zone.

    Note

    With max spreading, regardless of how many fault domains the VMs are spread across, you can only see one fault domain in both the scale set VM instance view and the instance metadata. The spreading within each zone is implicit.

  • Static fixed spreading (platformFaultDomainCount = 5). With static fixed spreading, the scale set spreads your VMs exactly across five fault domains per zone. If the scale set can't find five distinct fault domains per zone to satisfy the allocation request, the request fails.

  • Spreading aligned with managed disks fault domains (platformFaultDomainCount = 2 or 3) You can consider aligning the number of scale set fault domains with the number of managed disks fault domains. This alignment can help prevent loss of quorum if an entire managed disks fault domain goes down. The fault domain count can be set to less than or equal to the number of managed disks fault domains available in each of the regions. To learn about the number of Managed Disks fault domains by region, see [insert doc here](link here).

Zone balancing

For scale sets deployed across multiple zones (zone-redundant), you can choose either best effort zone balance or strict zone balance. A scale set is considered "balanced" if each zone has the same number of VMs (plus or minus one VM) as all other zones in the scale set. For example:

Scale Set VMs in Zone 1 VMs in Zone 2 VMs in Zone 3 Zone Balancing
Balanced scale set 2 3 3 This scale set is considered balanced. There's only one zone with a different VM count and it's only 1 less than the other zones.
Unbalanced scale set 1 3 3 This scale set is considered unbalanced. Zone 1 has 2 fewer VMs than zones 2 and 3.

It's possible that VMs in the scale set are successfully created, but extensions on those VMs fail to deploy. The VMs with extension failures are still counted when determining if a scale set is balanced. For instance, a scale set with 3 VMs in zone 1, 3 VMs in zone 2, and 3 VMs in zone 3 is considered balanced even if all extensions failed in zone 1 and all extensions succeeded in zones 2 and 3.

With best-effort zone balance, the scale set attempts to scale in and out while maintaining balance. However, if for some reason the balancing isn't possible (for example, if one zone goes down, the scale set can't create a new VM in that zone), the scale set allows temporary imbalance to successfully scale in or out. On subsequent scale-out attempts, the scale set adds VMs to zones that need more VMs for the scale set to be balanced. Similarly, on subsequent scale in attempts, the scale set removes VMs from zones that need fewer VMs for the scale set to be balanced. With "strict zone balance", the scale set fails any attempts to scale in or out if doing so would cause unbalance.

To use best-effort zone balance, set zoneBalance to false. The zoneBalance setting is the default in API version 2017-12-01. To use strict zone balance, set zoneBalance to true.

Migrate to availability zone support

To learn how to redeploy a regional scale set to availability zone support, see Migrate Virtual Machines and Virtual Machine Scale Sets to availability zone support.

Additional guidance

Placement groups

Important

Placement groups only apply to Virtual Machine Scale Sets running in Uniform orchestration mode.

When you deploy a Virtual Machine Scale Set, you have the option to deploy with a single or multiple placement groups per availability zone. For regional scale sets, the choice is to have a single placement group in the region or to have multiple placement groups in the region. If the scale set property singlePlacementGroup is set to false, the scale set can be composed of multiple placement groups and has a range of 0-1000 VMs. When set to the default value of true, the scale set is composed of a single placement group and has a range of 0-100 VMs. For most workloads, we recommend multiple placement groups, which allows for greater scale. In API version 2017-12-01, scale sets default to multiple placement groups for single-zone and cross-zone scale sets, but they default to single placement group for regional scale sets.

Next steps