Passive-cold solution overview for Azure Kubernetes Service (AKS)

When you create an application in Azure Kubernetes Service (AKS) and choose an Azure region during resource creation, it's a single-region app. When the region becomes unavailable during a disaster, your application also becomes unavailable. If you create an identical deployment in a secondary Azure region, your application becomes less susceptible to a single-region disaster, which guarantees business continuity, and any data replication across the regions lets you recover your last application state.

This guide outlines a passive-cold solution for AKS. Within this solution, we deploy two independent and identical AKS clusters into two paired Azure regions with only one cluster actively serving traffic when the application is needed.

Note

The following practice has been reviewed internally and vetted in conjunction with our Microsoft partners.

Passive-cold solution overview

In this approach, we have two independent AKS clusters being deployed in two Azure regions. When the application is needed, we activate the passive cluster to receive traffic. If the passive cluster goes down, we must manually activate the cold cluster to take over the flow of traffic. We can set this condition through a manual input every time or to specify a certain event.

Scenarios and configurations

This solution is best implemented as a “use as needed” workload, which is useful for scenarios that require workloads to run at specific times of day or run on demand. Example use cases for a passive-cold approach include:

  • A manufacturing company that needs to run a complex and resource-intensive simulation on a large dataset. In this case, the passive cluster is located in a cloud region that offers high-performance computing and storage services. The passive cluster is only used when the simulation is triggered by the user or by a schedule. If the cluster doesn’t work upon triggering, the cold cluster can be used as a backup and the workload can run on it instead.
  • A government agency that needs to maintain a backup of its critical systems and data in case of a cyber attack or natural disaster. In this case, the passive cluster is located in a secure and isolated location that’s not accessible to the public.

Components

The passive-cold disaster recovery solution uses many Azure services. This example architecture involves the following components:

Multiple clusters and regions: You deploy multiple AKS clusters, each in a separate Azure region. When the app is needed, the passive cluster is activated to receive network traffic.

Key Vault: You provision an Azure Key Vault in each region to store secrets and keys.

Log Analytics: Regional Log Analytics instances store regional networking metrics and diagnostic logs. A shared instance stores metrics and diagnostic logs for all AKS instances.

Hub-spoke pair: A hub-spoke pair is deployed for each regional AKS instance. Azure Firewall Manager policies manage the firewall rules across each region.

Container Registry: The container images for the workload are stored in a managed container registry. With this solution, a single Azure Container Registry instance is used for all Kubernetes instances in the cluster. Geo-replication for Azure Container Registry enables you to replicate images to the selected Azure regions and provides continued access to images even if a region experiences an outage.

Failover process

If the passive cluster isn't functioning properly because of an issue in its specific Azure region, you can activate the cold cluster and redirect all traffic to that cluster's region. You can use this process while the passive cluster is deactivated until it starts working again. The cold cluster can take a couple minutes to come online, as it has been turned off and needs to complete the setup process. This approach isn't ideal for time-sensitive applications. In that case, we recommend considering an active-active failover.

Application Pods (Regional)

A Kubernetes deployment object creates multiple replicas of a pod (ReplicaSet). If one is unavailable, traffic is routed between the remaining replicas. The Kubernetes ReplicaSet attempts to keep the specified number of replicas up and running. If one instance goes down, a new instance should be recreated. Liveness probes can check the state of the application or process running in the pod. If the pod is unresponsive, the liveness probe removes the pod, which forces the ReplicaSet to create a new instance.

For more information, see Kubernetes ReplicaSet.

Application Pods (Global)

When an entire region becomes unavailable, the pods in the cluster are no longer available to serve requests. In this case, the Azure Front Door instance routes all traffic to the remaining health regions. The Kubernetes clusters and pods in these regions continue to serve requests. To compensate for increased traffic and requests to the remaining cluster, keep in mind the following guidance:

  • Make sure network and compute resources are right sized to absorb any sudden increase in traffic due to region failover. For example, when using Azure Container Network Interface (CNI), make sure you have a subnet that can support all pod IPs with a spiked traffic load.
  • Use the Horizontal Pod Autoscaler to increase the pod replica count to compensate for the increased regional demand.
  • Use the AKS Cluster Autoscaler to increase the Kubernetes instance node counts to compensate for the increased regional demand.

Kubernetes node pools (Regional)

Occasionally, localized failure can occur to compute resources, such as power becoming unavailable in a single rack of Azure servers. To protect your AKS nodes from becoming a single point regional failure, use Azure Availability Zones. Availability zones ensure that AKS nodes in each availability zone are physically separated from those defined in another availability zones.

Kubernetes node pools (Global)

In a complete regional failure, Azure Front Door routes traffic to the remaining healthy regions. Again, make sure to compensate for increased traffic and requests to the remaining cluster.

Failover testing strategy

While there are no mechanisms currently available within AKS to take down an entire region of deployment for testing purposes, Azure Chaos Studio offers the ability to create a chaos experiment on your cluster.

Next steps

If you're considering a different solution, see the following articles: