High availability for Azure SQL Managed Instance

Applies to: Azure SQL Managed Instance

This article describes high availability in Azure SQL Managed Instance.

Important

Zone-redundant configuration is in public preview for the General Purpose service tier and generally available for the Business Critical service tier.

Overview

The goal of the high availability architecture in Azure SQL Managed Instance is to minimize impact on customer workloads from customer initiated management operations that result with a brief downtime, service maintenance operations, and unplanned outages. For more information regarding specific SLAs for different service tiers, review Azure SQL Managed Instance.

High-availability protects you from impact on the:

  • Availability zone that forms the datacenter (in case of multi-zone region)
  • Rack where nodes powering your service are running
  • Node itself
  • Application layer

To minimize the impact in case of regional or bigger outages, you can use one of the available techniques covered with our overview of business continuity.

SQL Managed Instance runs on the latest stable version of the SQL Server Database Engine on the Windows operating system with all applicable patches. SQL Managed Instance automatically handles critical servicing tasks, such as patching, backups, Windows and SQL engine upgrades, and unplanned events such as underlying hardware, software, or network failures. When an instance is patched or fails over, the downtime isn't impactful if you employ retry logic in your app. SQL Managed Instance can quickly recover even in the most critical circumstances, ensuring that your data is always available. Most users do not notice that upgrades are performed continuously.

The high availability solution is designed to ensure that committed data is never lost due to failures, that maintenance operations don't affect your workload, and that the instance won't be a single point of failure in your software architecture.

There are two different high availability architectural models based on the service tier:

  • The remote storage model is based on a separation of compute and storage in the General Purpose and Next-gen General Purpose service tiers that relies on the high availability and reliability of the remote storage and the high availability of compute clusters managed by Azure Service Fabric. This high availability model targets budget-oriented business applications that can tolerate some performance degradation during maintenance activities.
  • The local storage model is based on a cluster of database engine processes that rely on a quorum of available database engine nodes in the Business Critical service tier that have local storage. This local storage model targets mission-critical applications that have a high transaction rate and require high IO performance. The high availability architecture guarantees minimal performance impact on your workload during maintenance activities.

Locally redundant availability

Locally redundant availability is based on storing your compute nodes and data on within a single datacenter in the primary region and protects your data in the event of local failure, such as a small-scale network or power failure. If a large-scale disaster such as fire or flooding occurs within a region, all replicas of a storage account or data on the compute nodes might be lost or unrecoverable. As such, to further protect your data when using the locally redundant availability option, consider using a more resilient storage option for your database backups.

General Purpose service tier

The General Purpose service tier uses the remote storage availability architecture. The following figure shows four different nodes with the separated compute and storage layers.

Diagram showing separation of compute and storage.

The remote storage availability model includes two layers:

  • A stateless compute layer that runs the database engine process and contains only transient and cached data, such as the tempdb and model databases on the attached SSD, and plan cache, buffer pool, and columnstore pool in memory. This stateless node is operated by Azure Service Fabric that initializes database engine, controls health of the node, and performs failover to another node if necessary.
  • A stateful data layer with the database files (.mdf and .ldf) stored in Azure Blob Storage. Azure Blob Storage has built-in data availability and redundancy features. Locally redundant availability is based on storing your data on locally redundant storage (LRS) which copies your data three times within a single datacenter in the primary region. It guarantees that every record in the log file or page in the data file will be preserved even if database engine process crashes.

Whenever the database engine or the operating system is upgraded, or a failure is detected, Azure Service Fabric will move the stateless database engine process to another stateless compute node with sufficient free capacity. Data in Azure Blob storage isn't affected by the move, and the data/log files are attached to the newly initialized database engine process. This process guarantees high availability, but a heavy workload might experience some performance degradation during the transition since the new database engine process starts with cold cache.

Next-gen General Purpose service tier

Note

The Next-gen General Purpose service tier upgrade is currently in preview.

Next-gen General Purpose is an architectural upgrade to the existing General Purpose service tier that uses an upgraded remote storage layer which stores instance data and log files on managed disks instead of page blobs.

Business Critical service tier

The Business Critical service tier uses the local storage availability model, which integrates compute resources (database engine process) and storage (locally attached SSD) on a single node. High availability is achieved by replicating both compute and storage to additional nodes.

Diagram of a cluster of database engine nodes.

The underlying database files (.mdf/.ldf) are placed on attached SSD storage to provide very low latency IO for your workload. High availability is implemented using a technology similar to SQL Server Always On availability groups. The cluster includes a single primary replica that is accessible for read-write customer workloads, and up to three secondary replicas (compute and storage) that contain copies of data. The primary replica constantly pushes changes to the secondary replicas sequentially to ensure that data is persisted on a sufficient number of secondary replicas before committing each transaction. This process guarantees that, if the primary replica or a readable secondary replica become unavailable for any reason, a fully synchronized replica is always available to fail over to. Failover is initiated by Azure Service Fabric. Once a secondary replica becomes the new primary replica, another secondary replica is created to ensure the cluster has a sufficient number of replicas to maintain quorum. Once failover completes, Azure SQL connections are automatically redirected to the new primary replica (or readable secondary replica based on the connection string).

As an extra benefit, the local storage availability model includes the ability to redirect read-only Azure SQL connections to one of the secondary replicas. This feature is called Read Scale-Out. It provides 100% additional compute capacity at no extra charge to off-load read-only operations, such as analytical workloads, from the primary replica.

Zone-redundant availability

Zone-redundant availability is based on placing compute node and storage replicas across three Azure availability zones in the primary region. Each availability zone is a separate physical location with independent power, cooling, and networking.

By default, the cluster of nodes for the local storage availability model is created in the same datacenter. With the introduction of Azure Availability Zones, SQL Managed Instance can place different replicas of a Business Critical instance in different availability zones in the same region. In the same way, the stateless compute nodes of a General Purpose service tier are placed in a different availability zone, while stateful storage is using zone redundant storage (ZRS) configuration.

To eliminate a single point of failure, the control ring is also duplicated across multiple zones as three gateway rings (GW). The routing to a specific gateway ring is controlled by Azure Traffic Manager (ATM). By selecting a zone-redundant configuration, you can make your Business Critical or General Purpose instances resilient to a much larger set of failures, including catastrophic datacenter outages, without any changes to the application logic. You can also convert any existing Business Critical or General Purpose instances to zone-redundant configuration.

Because zone-redundant instances have replicas in different datacenters with some distance between them, the increased network latency might increase the transaction commit time, and thus impact the performance of some OLTP workloads. You can always return to the single-zone configuration by disabling the zone-redundancy setting. This process is an online operation similar to the regular service tier objective upgrade. At the end of the process, the instance is migrated from a zone-redundant ring to a single-zone ring or vice versa.

The zone-redundant version of the high availability architecture is illustrated by the following diagram:

Diagram of the zone-redundant high availability architecture.

Consider the following when using zone-redundancy:

  • Zone-redundancy is not available for the Next-gen General Purpose service tier.
  • For up to date information about the regions that support zone-redundant configurations, see Services support by region.
  • For zone redundant availability, choosing a maintenance window other than the default is currently available in select regions.

Supported regions for Business Critical instances

Zone redundancy for Business Critical SQL Managed Instance is supported in the following regions:

Americas Europe Middle East Africa Asia Pacific
Brazil South France Central Qatar Central South Africa North Australia East
Canada Central Italy North Israel Central Central India
Central US Germany West Central Japan East
East US Norway East Korea Central
East US 2 North Europe Southeast Asia
South Central US UK South East Asia
West US 2 Sweden Central
West US 3 Switzerland North
Poland Central

Supported regions for General Purpose instances

Note

Zone-redundant configuration is in public preview for the General Purpose service tier.

Americas Europe Middle East Africa Asia Pacific
Brazil South France Central Qatar Central South Africa North Australia East
East US Italy North Israel Central Central India
East US 2 Germany West Central Japan East
South Central US Norway East Korea Central
West US 2 North Europe Southeast Asia
West US 3 UK South East Asia
Sweden Central
Switzerland North
Poland Central

Test application fault resiliency

High availability is a fundamental part of the SQL Managed Instance platform that works transparently for your database application. However, we recognize that you might want to test how the automatic failover operations initiated during planned or unplanned events would impact an application before you deploy it to production. You can manually trigger a failover by calling a special API to restart a managed instance. In the case of a zone-redundant instance, the API call would result in redirecting client connections to the new primary in an Availability Zone different from the Availability Zone of the old primary. So in addition to testing how failover impacts existing database sessions, you can also verify if it changes the end-to-end performance due to changes in network latency. Because the restart operation is intrusive and a large number of them could stress the platform, only one failover call is allowed every 15 minutes for each managed instance.

A failover can be initiated using PowerShell, REST API, or Azure CLI:

PowerShell REST API Azure CLI
Invoke-AzSqlInstanceFailover SQL Managed Instance - Failover az sql mi failover can be used to invoke a REST API call from Azure CLI

Conclusion

Azure SQL Managed Instance features a built-in high availability solution that is deeply integrated with the Azure platform. The service depends on Service Fabric to detect failure and recover, Azure Blob storage to protect data, and on Availability Zones for higher fault tolerance. And for the Business Critical service tier, SQL Managed Instance uses SQL Server Always On availability group technology for database replication and failover. The combination of these technologies enables applications to fully realize the benefits of a mixed storage model and supports the most demanding SLAs.

Next steps