Hi Naidu, Nitin,
The unexpected reboot of your AKS node was due to an Azure-initiated VM shutdown triggered by temporary I/O transaction timeouts between the physical host node and Azure Storage, where your Virtual Hard Disks (VHDs) reside. This mechanism helps prevent data corruption by shutting down and then restarting the VM once connectivity is restored.
Does Using Multiple Availability Zones Help?
Using multiple availability zones in your AKS node pool can improve resilience and availability by distributing nodes across different zones. This ensures that if one zone experiences issues, workloads can still run on nodes in other zones. However, availability zones do not directly prevent node reboots caused by I/O transaction timeouts, as this issue is related to the underlying infrastructure rather than the zone itself.
To minimize the impact of node failures, consider the following best practices:
Use multiple node pools across availability zones for redundancy.
Enable AKS node auto-repair, which detects unhealthy nodes and replaces them automatically. Monitor Azure Resource Health to get insights into VM availability issues.
Relevant Microsoft Documentation:
Availability Zones in AKS: https://learn.microsoft.com/en-us/azure/aks/availability-zones-overview
Azure VM Auto-Recovery: https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-orchestration-modes
Azure Resource Health for VM Monitoring:https://learn.microsoft.com/en-us/azure/service-health/resource-health-overview
If it was helpful, please click "Upvote" on this post to let us know.
Thank You.