AKS node reboot

Question

AKS node reboot

Naidu, Nitin 0

Our AKS node is auto rebooting, for one of the reboot we got below, but for other reboot there was no explanation, does multiple availability zone in node group helps with this issue?

VM Availability

The Azure monitoring and diagnostics systems identified that your VM XXXXXX became unavailable at 2025-03-10 11:37:45 (UTC) and availability was restored at 2025-03-10 11:38:09 (UTC). During this time RDP and SSH connections to the VM, or requests to any other services running inside the VM, could have failed.

Root Cause

This unexpected occurrence was caused by an Azure initiated VM shutdown triggered by detection of temporary IO transaction timeouts between the physical host node where your VM was running, and the Azure Storage services where your Virtual Hard Disks (VHDs) reside. Azure platform continuously monitors reads and writes (IO transactions) from your VMs to Azure Storage. If transactions do not complete successfully within 180 seconds (inclusive of retries), the connectivity is considered to be lost and a temporary VM shutdown is initiated to preserve data integrity and prevent corruption of your VM. After the platform detects that the storage service connectivity is restored, the VM is automatically restarted.

Resolution

VM was restored following reboot of the host node.

1 answer

Your answer

Anonymous

2025-03-13T09:05:26.7133333+00:00

Hi Naidu, Nitin,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If it was helpful, please click "Upvote" on this post to let us know

Thank You.
Anonymous

2025-03-17T10:16:12.69+00:00

Hi Naidu, Nitin,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.

Answer 1

Hi Naidu, Nitin,

The unexpected reboot of your AKS node was due to an Azure-initiated VM shutdown triggered by temporary I/O transaction timeouts between the physical host node and Azure Storage, where your Virtual Hard Disks (VHDs) reside. This mechanism helps prevent data corruption by shutting down and then restarting the VM once connectivity is restored.

Does Using Multiple Availability Zones Help?

Using multiple availability zones in your AKS node pool can improve resilience and availability by distributing nodes across different zones. This ensures that if one zone experiences issues, workloads can still run on nodes in other zones. However, availability zones do not directly prevent node reboots caused by I/O transaction timeouts, as this issue is related to the underlying infrastructure rather than the zone itself.

To minimize the impact of node failures, consider the following best practices:

Use multiple node pools across availability zones for redundancy.

Enable AKS node auto-repair, which detects unhealthy nodes and replaces them automatically. Monitor Azure Resource Health to get insights into VM availability issues.

Relevant Microsoft Documentation:

Availability Zones in AKS: https://learn.microsoft.com/en-us/azure/aks/availability-zones-overview

Azure VM Auto-Recovery: https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-orchestration-modes

Azure Resource Health for VM Monitoring:https://learn.microsoft.com/en-us/azure/service-health/resource-health-overview

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.

Share via

AKS node reboot

1 answer

Your answer