AKS node reboot

Naidu, Nitin 0 Reputation points
2025-03-12T05:15:51.71+00:00

Our AKS node is auto rebooting, for one of the reboot we got below, but for other reboot there was no explanation, does multiple availability zone in node group helps with this issue?

VM Availability

The Azure monitoring and diagnostics systems identified that your VM XXXXXX became unavailable at 2025-03-10 11:37:45 (UTC) and availability was restored at 2025-03-10 11:38:09 (UTC). During this time RDP and SSH connections to the VM, or requests to any other services running inside the VM, could have failed.

Root Cause

This unexpected occurrence was caused by an Azure initiated VM shutdown triggered by detection of temporary IO transaction timeouts between the physical host node where your VM was running, and the Azure Storage services where your Virtual Hard Disks (VHDs) reside. Azure platform continuously monitors reads and writes (IO transactions) from your VMs to Azure Storage. If transactions do not complete successfully within 180 seconds (inclusive of retries), the connectivity is considered to be lost and a temporary VM shutdown is initiated to preserve data integrity and prevent corruption of your VM. After the platform detects that the storage service connectivity is restored, the VM is automatically restarted.

Resolution

VM was restored following reboot of the host node.

Recommended Documents

Learn More About:

We apologize for any inconvenience this may have caused you.

Microsoft Azure Team

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
9,015 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Anonymous
    2025-03-12T17:00:27.49+00:00

    Hi Naidu, Nitin,

    The unexpected reboot of your AKS node was due to an Azure-initiated VM shutdown triggered by temporary I/O transaction timeouts between the physical host node and Azure Storage, where your Virtual Hard Disks (VHDs) reside. This mechanism helps prevent data corruption by shutting down and then restarting the VM once connectivity is restored.

    Does Using Multiple Availability Zones Help?

    Using multiple availability zones in your AKS node pool can improve resilience and availability by distributing nodes across different zones. This ensures that if one zone experiences issues, workloads can still run on nodes in other zones. However, availability zones do not directly prevent node reboots caused by I/O transaction timeouts, as this issue is related to the underlying infrastructure rather than the zone itself.

    To minimize the impact of node failures, consider the following best practices:

    Use multiple node pools across availability zones for redundancy.

    Enable AKS node auto-repair, which detects unhealthy nodes and replaces them automatically. Monitor Azure Resource Health to get insights into VM availability issues.

    Relevant Microsoft Documentation:

    Availability Zones in AKS: https://learn.microsoft.com/en-us/azure/aks/availability-zones-overview

    Azure VM Auto-Recovery: https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-orchestration-modes

    Azure Resource Health for VM Monitoring:https://learn.microsoft.com/en-us/azure/service-health/resource-health-overview

    If it was helpful, please click "Upvote" on this post to let us know.

    Thank You.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.