Hello, @thanakrit.r !
What guidance is there for disaster recovery (DR) for on-premises deployments of AKS?
Azure Kubernetes Service (AKS) on Azure Stack HCI and Windows Server is an on-premises Kubernetes implementation of AKS. This means that the management cluster is deployed as a single standalone virtual machine (VM) per deployment.
Guidance for restoring the state of AKS on new hardware and recovering from management cluster corruption can be found in our AKS disaster recovery documentation:
https://learn.microsoft.com/en-us/azure/aks/hybrid/restore-aks-cluster
In AKS on Azure Stack HCI or Windows Server, the management cluster is deployed as a single standalone virtual machine (VM) per deployment, making it a single point of failure. It is important to note that a management cluster outage has no impact on applications running in the workload clusters. When the management cluster VM fails, the workload clusters (and workloads) continue running, but you won't be able to perform day-2 operations. For example, you cannot create new workload clusters, create or scale a node pool, or upgrade Kubernetes versions, until the VM is restored.
The management cluster is a VM that's tracked in Windows failover clustering. It is also resilient to host-level disruptions. In other words, during a host machine failure, Windows failover clustering restarts the VM on a healthy host machine. This article provides guidance on how to perform the following tasks:
- Restore the state of AKS on new hardware (could be a new site).
- Recovery from corruption of the management cluster.
In either of these scenarios, the management cluster and all the workload clusters must be recreated.
I hope this has been helpful! Your feedback is important so please take a moment to accept answers. If you still have questions, please let us know what is needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!