Reliability in Virtual Machines

This article contains specific reliability recommendations for Virtual Machines, as well as detailed information on VM regional resiliency with availability zones and cross-region disaster recovery and business continuity.

For an architectural overview of reliability in Azure, see Azure reliability.

Reliability recommendations

This section contains recommendations for achieving resiliency and availability. Each recommendation falls into one of two categories:

  • Health items cover areas such as configuration items and the proper function of the major components that make up your Azure Workload, such as Azure Resource configuration settings, dependencies on other services, and so on.

  • Risk items cover areas such as availability and recovery requirements, testing, monitoring, deployment, and other items that, if left unresolved, increase the chances of problems in the environment.

Reliability recommendations priority matrix

Each recommendation is marked in accordance with the following priority matrix:

Image Priority Description
High Immediate fix needed.
Medium Fix within 3-6 months.
Low Needs to be reviewed.

Reliability recommendations summary

Category Priority Recommendation
High Availability Run production workloads on two or more VMs using Azure Virtual Machine Scale Sets Flex
Deploy VMs across availability zones or use Virtual Machine Scale Sets Flex with zones
Migrate VMs using availability sets to Virtual Machine Scale Sets Flex
Use managed disks for VM disks
Disaster Recovery Replicate VMs using Azure Site Recovery
Back up data on your VMs with Azure Backup service
Performance Host application and database data on a data disk
Production VMs should be using SSD disks
Enable Accelerated Networking (AccelNet)
When AccelNet is enabled, you must manually update the GuestOS NIC drive
Management VM-9: Watch for VMs in Stopped state
Use maintenance configurations for the VM
Security VVMs shouldn't have a Public IP directly associated
Virtual Network Interfaces have an NSG associated
IP Forwarding should only be enabled for Network Virtual Appliances
Network access to the VM disk should be set to "Disable public access and enable private access"
Enable disk encryption and data at rest encryption by default
Networking Customer DNS Servers should be configured in the Virtual Network level
Storage Shared disks should only be enabled in clustered servers
Compliance Ensure that your VMs are compliant with Azure Policies
Monitoring Enable VM Insights
Configure diagnostic settings for all Azure resources

High availability

Run production workloads on two or more VMs using Virtual Machine Scale Sets Flex

To safeguard application workloads from downtime due to the temporary unavailability of a disk or VM, it's recommended that you run production workloads on two or more VMs using Virtual Machine Scale Sets Flex.

To run production workloads, you can use:

  • Azure Virtual Machine Scale Sets to create and manage a group of load balanced VMs. The number of VM instances can automatically increase or decrease in response to demand or a defined schedule.

  • Availability zones. For more information on availability zones and VMs, see Availability zone support.

// Azure Resource Graph Query
// Find all VMs that are not associated with a VMSS Flex instance
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnull(properties.virtualMachineScaleSet.id)
| project recommendationId="vm-1", name, id, tags

Deploy VMs across availability zones or use Virtual Machine Scale Sets Flex with zones*

When you create your VMs, use availability zones to protect your applications and data against unlikely datacenter failure. For more information about availability zones for VMs, see Availability zone support in this document.

For information on how to enable availability zones support when you create your VM, see create availability zone support.

For information on how to migrate your existing VMs to availability zone support, see migrate to availability zone support.

// Azure Resource Graph Query
// Find all VMs that are not assigned to a Zone
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnull(zones)
| project recommendationId="vm-2", name, id, tags, param1="No Zone"

Migrate VMs using availability sets to Virtual Machine Scale Sets Flex

Modernize your workloads by migrating them from VMs to Virtual Machine Scale Sets Flex.

With Virtual Machine Scale Sets Flex, you can deploy your VMs in one of two ways:

  • Across zones
  • In the same zone, but across fault domains (FDs) and update domains (UD) automatically.

In an N-tier application, it's recommended that you place each application tier into its own Virtual Machine Scale Sets Flex.

// Azure Resource Graph Query
// Find all VMs using Availability Sets
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnotnull(properties.availabilitySet)
| project recommendationId = "vm-3", name, id, tags, param1=strcat("availabilitySet: ",properties.availabilitySet.id)

Use managed disks for VM disks*

To provide better reliability for VMs in an availability set, use managed disks. Managed disks are sufficiently isolated from each other to avoid single points of failure. Also, managed disks aren’t subject to the IOPS limits of VHDs created in a storage account.

// Azure Resource Graph Query
// Find all VMs that are not using Managed Disks
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnull(properties.storageProfile.osDisk.managedDisk)
| project recommendationId = "vm-5", name, id, tags

Disaster recovery

Replicate VMs using Azure Site Recovery

When you replicate Azure VMs using Site Recovery, all VM disks are continuously replicated to the target region asynchronously. The recovery points are created every few minutes, which gives you a Recovery Point Objective (RPO) in the order of minutes. You can conduct disaster recovery drills as many times as you want, without affecting the production application or the ongoing replication.

To learn how to run a disaster recovery drill, see Run a test failover.

// Azure Resource Graph Query
// Find all VMs that do NOT have replication with ASR enabled
// Run query to see results.
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| project name, id, tags
| join kind=leftouter (
    recoveryservicesresources
    | where type =~ 'Microsoft.RecoveryServices/vaults/replicationFabrics/replicationProtectionContainers/replicationProtectedItems'
    | where properties.providerSpecificDetails.dataSourceInfo.datasourceType =~ 'AzureVm'
    | project id=properties.providerSpecificDetails.dataSourceInfo.resourceId
    | extend name=strcat_array(array_slice(split(id, '/'), 8, -1), '/')
) on name
| where isnull(id1)
| project-away id1
| project-away name1
| project recommendationId = "vm-4", name, id, tags
| order by id asc

Back up data on your VMs with Azure Backup service

The Azure Backup service provides simple, secure, and cost-effective solutions to back up your data and recover it from the Microsoft Azure cloud. For more information, see What is the Azure Backup Service.

// Azure Resource Graph Query
// Find all VMs that do NOT have Backup enabled
// Run query to see results.
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| project name, id, tags
| join kind=leftouter (
    recoveryservicesresources
    | where type =~ 'Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems'
    | where properties.dataSourceInfo.datasourceType =~ 'Microsoft.Compute/virtualMachines'
    | project idBackupEnabled=properties.sourceResourceId
    | extend name=strcat_array(array_slice(split(idBackupEnabled, '/'), 8, -1), '/')
) on name
| where isnull(idBackupEnabled)
| project-away idBackupEnabled
| project-away name1
| project recommendationId = "vm-7", name, id, tags
| order by id asc

Performance

Host application and database data on a data disk

A data disk is a managed disk that’s attached to a VM. Use the data disk to store application data, or other data you need to keep. Data disks are registered as SCSI drives and are labeled with a letter that you choose. Hosting your data on a data disk makes it easy to back up or restore your data. You can also migrate the disk without having to move the entire VM and Operating System. Also, you can select a different disk SKU, with different type, size, and performance that meet your requirements. For more information on data disks, see Data Disks.

// Azure Resource Graph Query
// Find all VMs that only have OS Disk
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where array_length(properties.storageProfile.dataDisks) < 1
| project recommendationId = "vm-6", name, id, tags

Production VMs should be using SSD disks

Premium SSD disks offer high-performance, low-latency disk support for I/O-intensive applications and production workloads. Standard SSD Disks are a cost-effective storage option optimized for workloads that need consistent performance at lower IOPS levels.

It's recommended that you:

  • Use Standard HDD disks for Dev/Test scenarios and less critical workloads at lowest cost.
  • Use Premium SSD disks instead of Standard HDD disks with your premium-capable VMs. For any Single Instance VM using premium storage for all Operating System Disks and Data Disks, Azure guarantees VM connectivity of at least 99.9%.

If you want to upgrade from Standard HDD to Premium SSD disks, consider the following issues:

  • Upgrading requires a VM reboot and this process takes 3-5 minutes to complete.
  • If VMs are mission-critical production VMs, evaluate the improved availability against the cost of premium disks.

For more information on Azure managed disks and disks types, see Azure managed disk types.

// Azure Resource Graph Query
// Find all disks with StandardHDD sku attached to VMs
Resources
| where type =~ 'Microsoft.Compute/disks'
| where sku.name == 'Standard_LRS' and sku.tier == 'Standard'
| where managedBy != ""
| project recommendationId = "vm-8", name, id, tags, param1=strcat("managedBy: ", managedBy)

Enable Accelerated Networking (AccelNet)

AccelNet enables single root I/O virtualization (SR-IOV) to a VM, greatly improving its networking performance. This high-performance path bypasses the host from the data path, which reduces latency, jitter, and CPU utilization for the most demanding network workloads on supported VM types.

For more information on Accelerated Networking, see Accelerated Networking

// Azure Resource Graph Query
// Find all VM NICs that do not have Accelerated Networking enabled
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| mv-expand nic = properties.networkProfile.networkInterfaces
| project name, id, tags, lowerCaseNicId = tolower(nic.id), vmSize = tostring(properties.hardwareProfile.vmSize)
| join kind = inner (
    resources
    | where type =~ 'Microsoft.Network/networkInterfaces'
    | where properties.enableAcceleratedNetworking == false
    | project nicName = split(id, "/")[8], lowerCaseNicId = tolower(id)
    )
    on lowerCaseNicId
| summarize nicNames = make_set(nicName) by name, id, tostring(tags), vmSize
| extend param1 = strcat("NicName: ", strcat_array(nicNames, ", ")), param2 = strcat("VMSize: ", vmSize)
| project recommendationId = "vm-10", name, id, tags, param1, param2
| order by id asc

When AccelNet is enabled, you must manually update the GuestOS NIC driver

When AccelNet is enabled, the default Azure Virtual Network interface in the GuestOS is replaced for a Mellanox interface. As a result, the GuestOS NIC driver is provided from Mellanox, a third party vendor. Although Marketplace images maintained by Microsoft are offered with the latest version of Mellanox drivers, once the VM is deployed, you need to manually update GuestOS NIC driver every six months.

// cannot-be-validated-with-arg

Management

Review VMs in stopped state

VM instances go through different states, including provisioning and power states. If a VM is in a stopped state, the VM may be facing an issue or is no longer necessary and could be removed to help reduce costs.

// Azure Resource Graph Query
// Find all VMs that are NOT running
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where properties.extended.instanceView.powerState.displayStatus != 'VM running'
| project recommendationId = "vm-9", name, id, tags

Use maintenance configurations for the VM

To ensure that VM updates/interruptions are done in a planned time frame, use maintenance configuration settings to schedule and manage updates. For more information on managing VM updates with maintenance configurations, see Managing VM updates with Maintenance Configurations.

// Azure Resource Graph Query
// Find VMS that do not have maintenance configuration assigned
Resources
| extend resourceId = tolower(id)
| project name, location, type, id, tags, resourceId, properties
| where type =~ 'Microsoft.Compute/virtualMachines'
| join kind=leftouter (
maintenanceresources
| where type =~ "microsoft.maintenance/configurationassignments"
| project planName = name, type, maintenanceProps = properties
| extend resourceId = tostring(maintenanceProps.resourceId)
) on resourceId
| where isnull(maintenanceProps)
| project recommendationId = "vm-22",name, id, tags
| order by id asc

Security

VMs shouldn't have a Public IP directly associated

If a VM requires outbound internet connectivity, it's recommended that you use NAT Gateway or Azure Firewall. NAT Gateway or Azure Firewall help to increase security and resiliency of the service, since both services have higher availability and Source Network Address Translation (SNAT) ports. For inbound internet connectivity, it's recommended that you use a load balancing solution such as Azure Load Balancer and Application Gateway.

// Azure Resource Graph Query
// Find all VMs with PublicIPs directly associated with them
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnotnull(properties.networkProfile.networkInterfaces)
| mv-expand nic=properties.networkProfile.networkInterfaces
| project name, id, tags, nicId = nic.id
| extend nicId = tostring(nicId)
| join kind=inner (
    Resources
    | where type =~ 'Microsoft.Network/networkInterfaces'
    | where isnotnull(properties.ipConfigurations)
    | mv-expand ipconfig=properties.ipConfigurations
    | extend publicIp = tostring(ipconfig.properties.publicIPAddress.id)
    | where publicIp != ""
    | project name, nicId = tostring(id), publicIp
) on nicId
| project recommendationId = "vm-12", name, id, tags
| order by id asc

VM network interfaces have a Network Security Group (NSG) associated*

It's recommended that you associate an NSG to a subnet, or a network interface, but not both. Since rules in an NSG associated to a subnet can conflict with rules in an NSG associated to a network interface, you can have unexpected communication problems that require troubleshooting. For more information, see Intra-Subnet traffic.

// Azure Resource Graph Query
// Provides a list of virtual machines and associated NICs that do have an NSG associated to them and also an NSG associated to the subnet.
Resources
| where type =~ 'Microsoft.Network/networkInterfaces'
| where isnotnull(properties.networkSecurityGroup)
| mv-expand ipConfigurations = properties.ipConfigurations, nsg = properties.networkSecurityGroup
| project nicId = tostring(id), subnetId = tostring(ipConfigurations.properties.subnet.id), nsgName=split(nsg.id, '/')[8]
| parse kind=regex subnetId with '/virtualNetworks/' virtualNetwork '/subnets/' subnet
    | join kind=inner (
        Resources
        | where type =~ 'Microsoft.Network/NetworkSecurityGroups' and isnotnull(properties.subnets)
        | project name, resourceGroup, subnet=properties.subnets
        | mv-expand subnet
        | project subnetId=tostring(subnet.id)
    ) on subnetId
    | project nicId
| join kind=leftouter (
    Resources
    | where type =~ 'Microsoft.Compute/virtualMachines'
    | where isnotnull(properties.networkProfile.networkInterfaces)
    | mv-expand nic=properties.networkProfile.networkInterfaces
    | project vmName = name, vmId = id, tags, nicId = nic.id, nicName=split(nic.id, '/')[8]
    | extend nicId = tostring(nicId)
) on nicId
| project recommendationId = "vm-13", name=vmName, id = vmId, tags, param1 = strcat("nic-name=", nicName)

IP forwarding should only be enabled for network virtual appliances

IP forwarding enables the virtual machine network interface to:

  • Receive network traffic not destined for one of the IP addresses assigned to any of the IP configurations assigned to the network interface.

  • Send network traffic with a different source IP address than the one assigned to one of a network interface’s IP configurations.

The IP forwarding setting must be enabled for every network interface that's attached to the VM receiving traffic to be forwarded. A VM can forward traffic whether it has multiple network interfaces, or a single network interface attached to it. While IP forwarding is an Azure setting, the VM must also run an application that's able to forward the traffic, such as firewall, WAN optimization, and load balancing applications.

To learn how to enable or disable IP forwarding, see Enable or disable IP forwarding.

// Azure Resource Graph Query
// Find all VM NICs that have IPForwarding enabled. This feature is usually only required for Network Virtual Appliances
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnotnull(properties.networkProfile.networkInterfaces)
| mv-expand nic=properties.networkProfile.networkInterfaces
| project name, id, tags, nicId = nic.id
| extend nicId = tostring(nicId)
| join kind=inner (
    Resources
    | where type =~ 'Microsoft.Network/networkInterfaces'
    | where properties.enableIPForwarding == true
    | project nicId = tostring(id)
) on nicId
| project recommendationId = "vm-14", name, id, tags
| order by id asc

Network access to the VM disk should be set to "Disable public access and enable private access"

It's recommended that you set VM disk network access to “Disable public access and enable private access” and create a private endpoint. To learn how to create a private endpoint, see Create a private endpoint.

// Azure Resource Graph Query
// Find all Disks with "Enable public access from all networks" enabled
resources
| where type =~ 'Microsoft.Compute/disks'
| where properties.publicNetworkAccess == "Enabled"
| project id, name, tags, lowerCaseDiskId = tolower(id)
| join kind = leftouter (
    resources
    | where type =~ 'Microsoft.Compute/virtualMachines'
    | project osDiskVmName = name, lowerCaseOsDiskId = tolower(properties.storageProfile.osDisk.managedDisk.id)
    | join kind = fullouter (
        resources
        | where type =~ 'Microsoft.Compute/virtualMachines'
        | mv-expand dataDisks = properties.storageProfile.dataDisks
        | project dataDiskVmName = name, lowerCaseDataDiskId = tolower(dataDisks.managedDisk.id)
        )
        on $left.lowerCaseOsDiskId == $right.lowerCaseDataDiskId
    | project lowerCaseDiskId = coalesce(lowerCaseOsDiskId, lowerCaseDataDiskId), vmName = coalesce(osDiskVmName, dataDiskVmName)
    )
    on lowerCaseDiskId
| summarize vmNames = make_set(vmName) by name, id, tostring(tags)
| extend param1 = iif(isempty(vmNames[0]), "VMName: n/a", strcat("VMName: ", strcat_array(vmNames, ", ")))
| project recommendationId = "vm-17", name, id, tags, param1
| order by id asc

Enable disk encryption and data at rest encryption by default

There are several types of encryption available for your managed disks, including Azure Disk Encryption (ADE), Server-Side Encryption (SSE) and encryption at host.

  • Azure Disk Encryption helps protect and safeguard your data to meet your organizational security and compliance commitments.
  • Azure Disk Storage Server-Side Encryption (also referred to as encryption-at-rest or Azure Storage encryption) automatically encrypts data stored on Azure managed disks (OS and data disks) when persisting on the Storage Clusters.
  • Encryption at host ensures that data stored on the VM host hosting your VM is encrypted at rest and flows encrypted to the Storage clusters.
  • Confidential disk encryption binds disk encryption keys to the VM’s TPM and makes the protected disk content accessible only to the VM.

For more information about managed disk encryption options, see Overview of managed disk encryption options.

// under-development

Networking

DNS Servers should be configured in the Virtual Network level

Configure the DNS Server in the Virtual Network to avoid name resolution inconsistency across the environment. For more information on name resolution for resources in Azure virtual networks, see Name resolution for VMs and cloud services.

// Azure Resource Graph Query
// Find all VM NICs that have DNS Server settings configured in any of the NICs
Resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| where isnotnull(properties.networkProfile.networkInterfaces)
| mv-expand nic=properties.networkProfile.networkInterfaces
| project name, id, tags, nicId = nic.id
| extend nicId = tostring(nicId)
| join kind=inner (
    Resources
    | where type =~ 'Microsoft.Network/networkInterfaces'
    | project name, id, dnsServers = properties.dnsSettings.dnsServers
    | extend hasDns = array_length(dnsServers) >= 1
    | where hasDns != 0
    | project name, nicId = tostring(id)
) on nicId
| project recommendationId = "vm-15", name, id, tags
| order by id asc

Storage

Shared disks should only be enabled in clustered servers

Azure shared disks is a feature of Azure managed disks that enables you to attach a managed disk to multiple VMs simultaneously. When you attach a managed disk to multiple VMs, you can either deploy new or migrate existing clustered applications to Azure. Shared disks should only be used in those situations where the disk is assigned to more than one VM member of a cluster.

To learn more about how to enable shared disks for managed disks, see Enable shared disk.

// Azure Resource Graph Query
// Find all Disks configured to be Shared. This is not an indication of an issue, but if a disk with this configuration is assigned to two or more VMs without a proper disk control mechanism (like a WSFC) it can lead to data loss
resources
| where type =~ 'Microsoft.Compute/disks'
| where isnotnull(properties.maxShares)
| project id, name, tags, lowerCaseDiskId = tolower(id), diskState = tostring(properties.diskState)
| join kind = leftouter (
    resources
    | where type =~ 'Microsoft.Compute/virtualMachines'
    | project osDiskVmName = name, lowerCaseOsDiskId = tolower(properties.storageProfile.osDisk.managedDisk.id)
    | join kind = fullouter (
        resources
        | where type =~ 'Microsoft.Compute/virtualMachines'
        | mv-expand dataDisks = properties.storageProfile.dataDisks
        | project dataDiskVmName = name, lowerCaseDataDiskId = tolower(dataDisks.managedDisk.id)
        )
        on $left.lowerCaseOsDiskId == $right.lowerCaseDataDiskId
    | project lowerCaseDiskId = coalesce(lowerCaseOsDiskId, lowerCaseDataDiskId), vmName = coalesce(osDiskVmName, dataDiskVmName)
    )
    on lowerCaseDiskId
| summarize vmNames = make_set(vmName) by name, id, tostring(tags), diskState
| extend param1 = strcat("DiskState: ", diskState), param2 = iif(isempty(vmNames[0]), "VMName: n/a", strcat("VMName: ", strcat_array(vmNames, ", ")))
| project recommendationId = "vm-16", name, id, tags, param1, param2
| order by id asc

Compliance

Ensure that your VMs are compliant with Azure Policies

It’s important to keep your virtual machine (VM) secure for the applications that you run. Securing your VMs can include one or more Azure services and features that cover secure access to your VMs and secure storage of your data. For more information on how to keep your VM and applications secure, see Azure Policy Regulatory Compliance controls for Azure Virtual Machines.

// Azure Resource Graph Query
// Find all VMs in "NonCompliant" state with Azure Policies
PolicyResources
| where type =~ "Microsoft.PolicyInsights/policyStates" and properties.resourceType =~ "Microsoft.Compute/virtualMachines" and properties.complianceState =~ "NonCompliant"
| project
    policyAssignmentName = properties.policyAssignmentName,
    policyDefinitionName = properties.policyDefinitionName,
    lowerCasePolicyDefinitionIdOfPolicyState = tolower(properties.policyDefinitionId),
    lowerCaseVmIdOfPolicyState = tolower(properties.resourceId)
| join kind = leftouter (
    PolicyResources
    | where type =~ "Microsoft.Authorization/policyDefinitions"
    | project lowerCasePolicyDefinitionId = tolower(id), policyDefinitionDisplayName = properties.displayName
    )
    on $left.lowerCasePolicyDefinitionIdOfPolicyState == $right.lowerCasePolicyDefinitionId
| project policyAssignmentName, policyDefinitionName, policyDefinitionDisplayName, lowerCaseVmIdOfPolicyState
| join kind = leftouter (
    Resources
    | where type =~ "Microsoft.Compute/virtualMachines"
    | project vmName = name, vmId = id, vmTags = tags, lowerCaseVmId = tolower(id)
    )
    on $left.lowerCaseVmIdOfPolicyState == $right.lowerCaseVmId
| extend
    param1 = strcat("AssignmentName: ", policyAssignmentName),
    param2 = strcat("DefinitionName: ", policyDefinitionDisplayName),  // Align to Azure portal's term.
    param3 = strcat("DefinitionID: ", policyDefinitionName)            // Align to Azure portal's term.
| project recommendationId = "vm-18", name = vmName, id = vmId, tags = vmTags, param1, param2, param3

Monitoring

Enable VM Insights

Enable VM Insights to get more visibility into the health and performance of your virtual machine. VM Insights gives you information on the performance and health of your VMs and virtual machine scale sets, by monitoring their running processes and dependencies on other resources. VM Insights can help deliver predictable performance and availability of vital applications by identifying performance bottlenecks and network issues. Insights can also help you understand whether an issue is related to other dependencies.

// Azure Resource Graph Query
// Check for VMs without Azure Monitoring Agent extension installed, missing Data Collection Rule or Data Collection Rule without performance enabled.
Resources
| where type == 'microsoft.compute/virtualmachines'
| project idVm = tolower(id), name, tags
| join kind=leftouter (
    InsightsResources
    | where type =~ "Microsoft.Insights/dataCollectionRuleAssociations" and id has "Microsoft.Compute/virtualMachines"
    | project idDcr = tolower(properties.dataCollectionRuleId), idVmDcr = tolower(substring(id, 0, indexof(id, "/providers/Microsoft.Insights/dataCollectionRuleAssociations/"))))
on $left.idVm == $right.idVmDcr
| join kind=leftouter (
    Resources
    | where type =~ "Microsoft.Insights/dataCollectionRules"
    | extend
        isPerformanceEnabled = iif(properties.dataSources.performanceCounters contains "Microsoft-InsightsMetrics" and properties.dataFlows contains "Microsoft-InsightsMetrics", true, false),
        isMapEnabled = iif(properties.dataSources.extensions contains "Microsoft-ServiceMap" and properties.dataSources.extensions contains "DependencyAgent" and properties.dataFlows contains "Microsoft-ServiceMap", true, false)//,
    | where isPerformanceEnabled or isMapEnabled
    | project dcrName = name, isPerformanceEnabled, isMapEnabled, idDcr = tolower(id))
on $left.idDcr == $right.idDcr
| join kind=leftouter (
    Resources
        | where type == 'microsoft.compute/virtualmachines/extensions' and (name contains 'AzureMonitorWindowsAgent' or name contains 'AzureMonitorLinuxAgent')
        | extend idVmExtension = tolower(substring(id, 0, indexof(id, '/extensions'))), extensionName = name)
on $left.idVm == $right.idVmExtension
| where isPerformanceEnabled != 1 or (extensionName != 'AzureMonitorWindowsAgent' and extensionName != 'AzureMonitorLinuxAgent')
| project recommendationId = "vm-20", name, id = idVm, tags, param1 = strcat('MonitoringExtension:', extensionName), param2 = strcat('DataCollectionRuleId:', idDcr), param3 = strcat('isPerformanceEnabled:', isPerformanceEnabled)

Configure diagnostic settings for all Azure resources

Platform metrics are sent automatically to Azure Monitor Metrics by default and without configuration. Platform logs provide detailed diagnostic and auditing information for Azure resources and the Azure platform they depend on and are one of the following types:

  • Resource logs that aren’t collected until they’re routed to a destination.
  • Activity logs that exist on their own but can be routed to other locations.

Each Azure resource requires its own diagnostic setting, which defines the following criteria:

  • Sources The type of metric and log data to send to the destinations defined in the setting. The available types vary by resource type.
  • Destinations: One or more destinations to send to.

A single diagnostic setting can define no more than one of each of the destinations. If you want to send data to more than one of a particular destination type (for example, two different Log Analytics workspaces), create multiple settings. Each resource can have up to five diagnostic settings.

Fore information, see Diagnostic settings in Azure Monitor.

// Azure Resource Graph Query
// Find all Virtual Machines without diagnostic settings enabled/with diagnostic settings enabled but not configured both performance counters and event logs/syslogs.
resources
| where type =~ "microsoft.compute/virtualmachines"
| project name, id, tags, lowerCaseVmId = tolower(id)
| join kind = leftouter (
    resources
    | where type =~ "Microsoft.Compute/virtualMachines/extensions" and properties.publisher =~ "Microsoft.Azure.Diagnostics"
    | project
        lowerCaseVmIdOfExtension = tolower(substring(id, 0, indexof(id, "/extensions/"))),
        extensionType = properties.type,
        provisioningState = properties.provisioningState,
        storageAccount = properties.settings.StorageAccount,
        // Windows
        wadPerfCounters = properties.settings.WadCfg.DiagnosticMonitorConfiguration.PerformanceCounters.PerformanceCounterConfiguration,
        wadEventLogs = properties.settings.WadCfg.DiagnosticMonitorConfiguration.WindowsEventLog,
        // Linux
        ladPerfCounters = properties.settings.ladCfg.diagnosticMonitorConfiguration.performanceCounters.performanceCounterConfiguration,
        ladSyslog = properties.settings.ladCfg.diagnosticMonitorConfiguration.syslogEvents
    | extend
        // Windows
        isWadPerfCountersConfigured = iif(array_length(wadPerfCounters) > 0, true, false),
        isWadEventLogsConfigured = iif(isnotnull(wadEventLogs) and array_length(wadEventLogs.DataSource) > 0, true, false),
        // Linux
        isLadPerfCountersConfigured = iif(array_length(ladPerfCounters) > 0, true, false),
        isLadSyslogConfigured = isnotnull(ladSyslog)
    | project
        lowerCaseVmIdOfExtension,
        extensionType,
        provisioningState,
        storageAccount,
        isPerfCountersConfigured = case(extensionType =~ "IaaSDiagnostics", isWadPerfCountersConfigured, extensionType =~ "LinuxDiagnostic", isLadPerfCountersConfigured, false),
        isEventLogsConfigured = case(extensionType =~ "IaaSDiagnostics", isWadEventLogsConfigured, extensionType =~ "LinuxDiagnostic", isLadSyslogConfigured, false)
    )
    on $left.lowerCaseVmId == $right.lowerCaseVmIdOfExtension
| where isempty(lowerCaseVmIdOfExtension) or provisioningState !~ "Succeeded" or not(isPerfCountersConfigured and isEventLogsConfigured)
| extend
    param1 = strcat("DiagnosticSetting: ", iif(isnotnull(extensionType), strcat("Enabled, partially configured (", extensionType, ")"), "Not enabled")),
    param2 = strcat("ProvisioningState: ", iif(isnotnull(provisioningState), provisioningState, "n/a")),
    param3 = strcat("storageAccount: ", iif(isnotnull(storageAccount), storageAccount, "n/a")),
    param4 = strcat("PerformanceCounters: ", case(isnull(isPerfCountersConfigured), "n/a", isPerfCountersConfigured, "Configured", "Not configured")),
    param5 = strcat("EventLogs/Syslogs: ", case(isnull(isEventLogsConfigured), "n/a", isEventLogsConfigured, "Configured", "Not configured"))
| project recommendationId = "vm-21", name, id, tags, param1, param2, param3, param4, param5

Availability zone support

Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones.

Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see Regions and availability zones.

Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see Recommendations for using availability zones and regions.

Virtual machines support availability zones with three availability zones per supported Azure region and are also zone-redundant and zonal. For more information, see availability zones support. The customer is responsible for configuring and migrating their virtual machines for availability.

To learn more about availability zone readiness options, see:

Prerequisites

SLA improvements

Because availability zones are physically separate and provide distinct power source, network, and cooling, SLAs (Service-level agreements) increase. For more information, see the SLA for Virtual Machines.

Create a resource with availability zones enabled

Get started by creating a virtual machine (VM) with availability zone enabled from the following deployment options below:

Zonal failover support

You can set up virtual machines to fail over to another zone using the Site Recovery service. For more information, see Site Recovery.

Fault tolerance

Virtual machines can fail over to another server in a cluster, with the VM's operating system restarting on the new server. You should refer to the failover process for disaster recovery, gathering virtual machines in recovery planning, and running disaster recovery drills to ensure their fault tolerance solution is successful.

For more information, see the site recovery processes.

Zone down experience

During a zone-wide outage, you should expect a brief degradation of performance until the virtual machine service self-healing rebalances underlying capacity to adjust to healthy zones. Self-healing isn't dependent on zone restoration; it's expected that the Microsoft-managed service self-healing state compensates for a lost zone, using capacity from other zones.

You should also prepare for the possibility that there's an outage of an entire region. If there's a service disruption for an entire region, the locally redundant copies of your data would temporarily be unavailable. If geo-replication is enabled, three other copies of your Azure Storage blobs and tables are stored in a different region. When there's a complete regional outage or a disaster in which the primary region isn't recoverable, Azure remaps all of the DNS entries to the geo-replicated region.

Zone outage preparation and recovery

The following guidance is provided for Azure virtual machines during a service disruption of the entire region where your Azure virtual machine application is deployed:

Low-latency design

Cross Region (secondary region), Cross Subscription (preview), and Cross Zonal (preview) are available options to consider when designing a low-latency virtual machine solution. For more information on these options, see the supported restore methods.

Important

By opting out of zone-aware deployment, you forego protection from isolation of underlying faults. Use of SKUs that don't support availability zones or opting out from availability zone configuration forces reliance on resources that don't obey zone placement and separation (including underlying dependencies of these resources). These resources shouldn't be expected to survive zone-down scenarios. Solutions that leverage such resources should define a disaster recovery strategy and configure a recovery of the solution in another region.

Safe deployment techniques

When you opt for availability zones isolation, you should utilize safe deployment techniques for application code and application upgrades. In addition to configuring Azure Site Recovery, and implement any one of the following safe deployment techniques for VMs:

As Microsoft periodically performs planned maintenance updates, there may be rare instances when these updates require a reboot of your virtual machine to apply the required updates to the underlying infrastructure. To learn more, see availability considerations during scheduled maintenance.

Before you upgrade your next set of nodes in another zone, you should perform the following tasks:

Migrate to availability zone support

To learn how to migrate a VM to availability zone support, see Migrate Virtual Machines and Virtual Machine Scale Sets to availability zone support.

Cross-region disaster recovery and business continuity

Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see Recommendations for designing a disaster recovery strategy.

When it comes to DR, Microsoft uses the shared responsibility model. In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use service-specific features to support fast recovery to help develop your DR plan.

You can use Cross Region restore to restore Azure VMs via paired regions. With Cross Region restore, you can restore all the Azure VMs for the selected recovery point if the backup is done in the secondary region. For more information on Cross Region restore, refer to the Cross Region table row entry in our restore options.

Disaster recovery in multi-region geography

In the case of a region-wide service disruption, Microsoft works diligently to restore the virtual machine service. However, you still must rely on other application-specific backup strategies to achieve the highest level of availability. For more information, see the section on Data strategies for disaster recovery.

Outage detection, notification, and management

Hardware or physical infrastructure for the virtual machine may fail unexpectedly. Unexpected failures can include local network failures, local disk failures, or other rack level failures. When detected, the Azure platform automatically migrates (heals) your virtual machine to a healthy physical machine in the same data center. During the healing procedure, virtual machines experience downtime (reboot) and in some cases loss of the temporary drive. The attached OS and data disks are always preserved.

For more detailed information on virtual machine service disruptions, see disaster recovery guidance.

Set up disaster recovery and outage detection

When setting up disaster recovery for virtual machines, understand what Azure Site Recovery provides. Enable disaster recovery for virtual machines with the below methods:

Disaster recovery in single-region geography

With disaster recovery setup, Azure VMs continuously replicate to a different target region. If an outage occurs, you can fail over VMs to the secondary region, and access them from there.

When you replicate Azure VMs using Site Recovery, all the VM disks are continuously replicated to the target region asynchronously. The recovery points are created every few minutes, which grants you a Recovery Point Objective (RPO) in the order of minutes. You can conduct disaster recovery drills as many times as you want, without affecting the production application or the ongoing replication. For more information, see Run a disaster recovery drill to Azure.

For more information, see Azure VMs architectural components and region pairing.

Capacity and proactive disaster recovery resiliency

Microsoft and its customers operate under the Shared Responsibility Model. Shared responsibility means that for customer-enabled DR (customer-responsible services), you must address DR for any service they deploy and control. To ensure that recovery is proactive, you should always pre-deploy secondaries because there's no guarantee of capacity at time of impact for those who haven't preallocated.

For deploying virtual machines, you can use flexible orchestration mode on Virtual Machine Scale Sets. All VM sizes can be used with flexible orchestration mode. Flexible orchestration mode also offers high availability guarantees (up to 1000 VMs) by spreading VMs across fault domains either within a region or within an availability zone.

Next steps