AKS automatic Node repair with VMSS

Poluri, Venudhar 1 Reputation point
2021-06-18T15:33:55.987+00:00

In AKS, When we shutdown a VM , it is recognized as NotReady, but it is not coming up even after 30 minutes. We are using zones and with that Virtual Machine scale sets are automatically enabled. So we created a health extension(ApplicationHealthLinux) on the VMSS created by AKS. And when we are enabling automatic repairs on the VMSS it is failing with the below error -
"Automatic repairs not supported for this Virtual Machine Scale Set because a health probe or health extension was not provided".

Is automatic node repairs supported in AKS with VMSS? And are there any alternatives/workarounds?

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
7,201 questions
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,877 questions
Azure Virtual Machine Scale Sets
Azure Virtual Machine Scale Sets
Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.
352 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. SRIJIT-BOSE-MSFT 4,331 Reputation points Microsoft Employee
    2021-06-21T09:03:27.167+00:00

    @Poluri, Venudhar , Thank you for the question.

    While AKS has resilience mechanisms to withstand a VM stop or deallocate config and recover from it, this isn't a supported configuration. Stop your cluster instead.
    Azure Kubernetes Service node auto-repair applies but works differently than Automatic instance repairs for Azure virtual machine scale sets.

    If the node is in a NotReady State for a long time after the node VM has started please try the following steps:

    1. SSH to the node. How-to
    2. Collect kubelet logs. How-to
    3. Check if the docker daemon is running with sudo systemctl status docker [For containerd use sudo systemctl status containerd]. For Windows nodes use Get-Service command
    4. If it is inactive, try starting docker using sudo systemctl start docker [For containerd use sudo systemctl start containerd]. For Windows nodes use Start-Service command
    5. Check if the kubelet service is running with sudo systemctl status kubelet. For Windows nodes use Get-Service
    6. If it is inactive, try starting the kubelet service using sudo systemctl start kubelet. For Windows nodes use Start-Service
    7. If the node is still in a NotReady state try restarting the VM/VMSS instance.

    If you are still facing the issue please do let us know.

    ----------

    Hope this helps.

    Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

    0 comments No comments

  2. Poluri, Venudhar 1 Reputation point
    2021-06-21T09:56:18.363+00:00

    Thanks @SRIJIT-BOSE-MSFT for your inputs.

    We are actually trying to mimic the VM failure, we want to check how and when the AKS brings it back.
    So we tried to stop the VMSS Instance or login to a node and shut it down, in those cases AKS is not brining up the VM automatically, not sure why?
    Is there any way we can mimic the VM failure, and verify the automatic repair in AKS?


  3. Kaplingat, Nikhil 1 Reputation point
    2021-06-21T15:45:57.97+00:00

    I too was looking for this info. I just forced a node removal from my AKS cluster by running the command "az vmss deallocate". The node is expectedly shown by kubectl command as "NotReady". But the node has not come back even after 30 minutes. Looks like AKS node auto-repair did not work in this case.

    Please let me know once AKS team finds a predictable way to simulate the scenario which would kick in the node repair feature.

    0 comments No comments