How AKS replicate data if any node fails?

Tanul 1,251 Reputation points
2020-10-07T06:59:51.43+00:00

Team,

Let say, I setup AKS with one node only, we can name it as N1, and install gpu drivers, ingress controller and one application of kind deployment. After a month, I purchase another node N2. But, after few days N1 fails and as per this link, in extreme cases, if kubernetes recreate/reimage the machine then how does it replicate the data like gpu drivers, ingress controller and deployments. Is aks maintaining any snapshots of all the machines somewhere otherwise all my applications shall destroy.

Could you please shed some light on this.

Thank you

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,849 questions
Azure Virtual Machine Scale Sets
Azure Virtual Machine Scale Sets
Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.
345 questions
{count} votes

Accepted answer
  1. prmanhas-MSFT 17,886 Reputation points Microsoft Employee
    2020-10-07T20:42:08.99+00:00

    @Tanul · All kubernetes objects (pods, replica set, daemonset, statefulset, deployment, service, ingress, network policies, role, clusterrole, rolebinding, clusterrolebinding, persistent volumes, persistent volume claims, storage classes, CRDs etc.) will be retained.

    If your node becomes NotReady workloads (pods) will be evicted from the bad node and scheduled on an available node (if no other node is available the pods will be in a Pending State until a functioning node is available for scheduling). The pod configuration is stored in the highly available ETCD server.

    Best practice: Use PodDisruptionBudgets [Ref: https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-scheduler#plan-for-availability-using-pod-disruption-budgets]

    Data stored on disks for PVCs will not be affected if the node runs into an error. There can be data loss only if the disks are deleted by the customer.

    But if a pod is transferred to a new node the PVC disk might take a while to detach from the earlier node and attach to the new node

    Additional Links:

    https://learn.microsoft.com/en-us/azure/aks/node-auto-repair

    https://kubernetes.io/docs/tasks/run-application/configure-pdb/

    Hope it helps :)

    Please 'Accept as answer' if it helped, so that it can help others in the community looking for help on similar topics


1 additional answer

Sort by: Most helpful
  1. prmanhas-MSFT 17,886 Reputation points Microsoft Employee
    2020-10-07T15:39:35.72+00:00

    @@Tanul Firstly I will recommend to make use of atleast 2 nodes when it comes to your production environment to protect your system from region failure, deploy your application into multiple AKS clusters across different regions. Single node is fine for testing purpose.

    The two primary types of storage provided for volumes in AKS are backed by Azure Disks or Azure Files.

    The type of storage you use is defined using Kubernetes storage classes. The storage class is then referenced in the pod or deployment specification. These definitions work together to create the appropriate storage and connect it to pods.

    Coming to your query no your data won't be lost in case as mentioned in the link provided by you in query If the reimage is unsuccessful, create and reimage a new node. AKS has access over resource but not to delete storage associated so until and unless you are not deleting corresponding storage your application data is safe and can be used for deployment purpose as well.

    One quick point. If you are using dynamic Persistent Volumes, they are created in the infrastructure resource group of the AKS cluster. If you wants to use the disks even after deletion of the AKS cluster please do create a snapshot from the disk and a new disk from that snapshot in a different resource group because if the AKS cluster is deleted then the infrastructure resource group is also deleted at the same time (thereby deleting all the resources in it).

    For Backup and replication related information this article and this one as well is also a good source of information and I will recommend you to go through it to get clarity over how recovery and backup run in background.

    Hope it helps!!!

    Please 'Accept as answer' if it helped, so that it can help others in the community looking for help on similar topics