Node graceful shuntdown and was restarted accidentally
Our node was graceful shutdown by Azure and restarted accidentally, and when we tried to figure out what happened from the system logs, it only showed us that
Dec 8 04:02:36 aks-nodepool1-65338737-vmss000003 kernel: [4115058.087822] hv_utils: Shutdown request received - graceful shutdown initiated
We don't know what happened at that time, how can we know why the node was terminated and then restarted? We haven't found any errors from the system logs.
- Workspace Resource ID: /subscriptions/52ec665e-f75d-489a-b9c8-478eb54ce35d/resourcegroups/defaultresourcegroup-wus2/providers/microsoft.operationalinsights/workspaces/defaultworkspace-52ec665e-f75d-489a-b9c8-478eb54ce35d-wus2
- Resource ID: /subscriptions/52ec665e-f75d-489a-b9c8-478eb54ce35d/resourcegroups/mainnetResourceGroup/providers/Microsoft.ContainerService/managedClusters/mainnetCluster
- Nodepool: nodepool1
- Node: aks-nodepool1-65338737-vmss000003
Timeline:
- Dec 8 04:02 node3 was down, all services on this node were terminated
- Dec 8 04:11 node3 was up, the services were recovered later.
From the health event in the Azure portal, we found there was an unexpected event happened at 04:02:
- "title": "Stopping and deallocating",
- "details": "This virtual machine is stopped and deallocated as requested by an authorized user or process."
By checking the event history, there was no authorized user operated at that time, so it should be the authorized process. However, it has disabled the OS auto-update and autoscaling for this node pool and resource group.