Reliability in Azure Cosmos DB for MongoDB vCore

APPLIES TO: MongoDB vCore

This article contains detailed information on regional resiliency with availability zones and cross-region disaster recovery and business continuity for Azure Cosmos DB for MongoDB vCore.

For an architectural overview of reliability in Azure, see Azure reliability.

Availability zone support

Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones.

Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see Regions and availability zones.

Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see Recommendations for using availability zones and regions.

To gain availability zone support, you must enable High availability (HA).

HA avoids database downtime by maintaining standby replicas of every shard in a cluster. If a shard goes down, Azure Cosmos DB for MongoDB vCore switches incoming connections from the failed shard to its standby replica.

When HA is enabled in a region that supports availability zones, HA replica shards are provisioned in a different availability zone from their primary shards. HA replicas don't receive requests from clients unless their primary shard fails.

If HA is disabled, each shard has its own locally redundant storage (LRS) with three synchronous replicas maintained by Azure Storage service. If there's a single replica failure, the Azure Storage service detects the failure, and transparently re-creates the relevant data. For LRS storage durability, see Summary of redundancy options. However, in the case of a region failure, you run the risk of extensive downtime and possible data loss.

Create a resource with availability zones enabled

To enable availability zones, you must enable High availability (HA) when creating a cluster or in the Scale section of an existing cluster in the Azure portal.

Cross-region disaster recovery and business continuity

Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see Recommendations for designing a disaster recovery strategy.

When it comes to DR, Microsoft uses the shared responsibility model. In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use service-specific features to support fast recovery to help develop your DR plan.

Azure Cosmos DB for MongoDB vCore does not provide built-in automatic failover or disaster recovery. Planning for high availability is a critical step as your solution scales.

Disaster recovery in single-region geography

To maximize your uptime, plan ahead to maintain business continuity and prepare for disaster recovery with Azure Cosmos DB for MongoDB vCore.

While Azure services are designed to maximize uptime, unplanned service outages might occur. A disaster recovery plan ensures that you have a strategy in place for handling regional service outages.

Azure Cosmos DB for MongoDB vCore automatically takes backups of your data at regular intervals. The automatic backups are taken without affecting the performance or availability of the database operations. All backups are performed automatically in the background and stored separately from the source data in a storage service. These automatic backups are useful in scenarios when you accidentally delete or modify resources and later require the original versions.

Automatic backups are retained in various intervals based on whether the cluster is currently active or recently deleted.

Retention period
Active clusters 35 days
Deleted clusters 7 days

Design for high availability

High availability (HA) should be enabled for critical Azure Cosmos DB for MongoDB vCore clusters running production workloads. In an HA-enabled cluster, each shard serves as a primary along with a hot-standby shard provisioned in another availability zone. Replication between the primary and the secondary shard is synchronous by default. Any modification to the database is persisted on both the primary and the secondary (hot-standby) shards before a response from the database is received.

The service maintains health checks and heartbeats to each primary and secondary shard of the cluster. If a primary shard becomes unavailable due to a zone or regional outage, the secondary shard is automatically promoted to become the new primary and a subsequent secondary shard is built for the new primary. In addition, if a secondary shard becomes unavailable, the service auto creates a new secondary shard with a full copy of data from the primary.

If the service triggers a failover from the primary to the secondary shard, connections are seamlessly routed under the covers to the new primary shard.

Synchronous replication between the primary and secondary shards guarantees no data loss if there's a failover.

Next steps