AKS automatic Node repair with VMSS

Question

In AKS, When we shutdown a VM , it is recognized as NotReady, but it is not coming up even after 30 minutes. We are using zones and with that Virtual Machine scale sets are automatically enabled. So we created a health extension(ApplicationHealthLinux) on the VMSS created by AKS. And when we are enabling automatic repairs on the VMSS it is failing with the below error -
"Automatic repairs not supported for this Virtual Machine Scale Set because a health probe or health extension was not provided".

Is automatic node repairs supported in AKS with VMSS? And are there any alternatives/workarounds?

Answer

@Poluri, Venudhar , Thank you for the question.

While AKS has resilience mechanisms to withstand a VM stop or deallocate config and recover from it, this isn't a supported configuration. Stop your cluster instead.
Azure Kubernetes Service node auto-repair applies but works differently than Automatic instance repairs for Azure virtual machine scale sets.

If the node is in a NotReady State for a long time after the node VM has started please try the following steps:

SSH to the node. How-to
Collect kubelet logs. How-to
Check if the docker daemon is running with sudo systemctl status docker [For containerd use sudo systemctl status containerd]. For Windows nodes use Get-Service command
If it is inactive, try starting docker using sudo systemctl start docker [For containerd use sudo systemctl start containerd]. For Windows nodes use Start-Service command
Check if the kubelet service is running with sudo systemctl status kubelet. For Windows nodes use Get-Service
If it is inactive, try starting the kubelet service using sudo systemctl start kubelet. For Windows nodes use Start-Service
If the node is still in a NotReady state try restarting the VM/VMSS instance.

If you are still facing the issue please do let us know.

----------

Hope this helps.

Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

Answer

Thanks @SRIJIT-BOSE-MSFT for your inputs.

We are actually trying to mimic the VM failure, we want to check how and when the AKS brings it back.
So we tried to stop the VMSS Instance or login to a node and shut it down, in those cases AKS is not brining up the VM automatically, not sure why?
Is there any way we can mimic the VM failure, and verify the automatic repair in AKS?

Answer

I too was looking for this info. I just forced a node removal from my AKS cluster by running the command "az vmss deallocate". The node is expectedly shown by kubectl command as "NotReady". But the node has not come back even after 30 minutes. Looks like AKS node auto-repair did not work in this case.

Please let me know once AKS team finds a predictable way to simulate the scenario which would kick in the node repair feature.

AKS automatic Node repair with VMSS

3 answers