Node restarted after being marked as NotReady

Question

Node restarted after being marked as NotReady

Dimitris Bratsos 20

Posting this question again, as I was not allowed to comment under my already existing question.

Our node was suddenly marked as NotReady, which then caused it to restart. It has been running with no issues over the past weeks, and no changes in its configuration are done.

Our node is running on version v1.34.2

We have disabled the automatic OS upgrades and we have a manual upgrade policy set. So it (in theory) should not be an automatic update of some sort.

From the journalctl through a debug pod in that node, we can see the following before the restart:

"Node became not ready" node="[REDACTED]" condition={"type":"Ready","status":"False","lastHeartbeatTime":"2026-04-08T13:20:44Z","lastTransitionTime":"2026-04-08T13:20:44Z","reason":"KubeletNotReady","message":"container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"}

We are trying to find out the root cause, and not sure if this is it. Another thing we noticed, was a spike in our PVCs, which caused them to fill up, but would not expect this to be connected, as I would only expect the pods to be affected, not the node itself.

Is there something else we might have missed, or any reason why the node restarted/got marked as NotReady in the first place?

I compiled a more detailed timeline:

Azure scheduled a VM redeploy We observed:

VMEventScheduled: Redeploy Scheduled
- Resource Health: Redeploying to different host

Node disruption

Warning VMEventScheduled

Warning RedeployScheduled

Kubelet reports node NotReady

KubeletNotReady

NetworkPluginNotReady (cni plugin not initialized)

AKS auto-repair sequence

NodeRebootStart
NodeRebootError: DataDisksForceDetached
NodeReimageStart
NodeReimageError: DataDisksForceDetached
NodeRedeployStart
- NodeRedeployEnd

From kubectl describe node:

DiskPressure: False (KubeletHasNoDiskPressure)

MemoryPressure: False

PIDPressure: False

Resource usage was approximately:

CPU: ~81%

Memory: ~70%

I would expect that these levels do not trigger node-level instability or automatic repair actions

Our current scenarios are the following:

underlying host hardware issues (https://learn.microsoft.com/en-us/answers/questions/2239110/vm-start-repeated-failures-redeploying-due-to-host)
platform maintenance (https://learn.microsoft.com/en-us/azure/virtual-machines/understand-vm-reboots)

To better understand the root cause, could you confirm:

What triggered the initial VM redeploy (VMEventScheduled: Redeploy Scheduled) for this instance?
Was this due to one of the reasons mentioned above?

We are specifically trying to identify the event preceding the node becoming NotReady. We appreciate the support, and we want to make sure the issue allows comments to be posted afterwards, to ensure full resolution.

Himanshu Shekhar 6,065 Reputation points Microsoft External Staff Moderator

2026-04-10T01:08:40.65+00:00

Just checking if provided response was helpful! please let me know if you have any queries.
Dimitris Bratsos 20 Reputation points

2026-04-13T20:59:13.0266667+00:00

Thank you for your response!

As replied privately, the automatic response was not the most helpful, but your response gave a lot of clarity. Also waiting on the resolution of the ticket based on this post, to ensure the hypothesis above is correct
Himanshu Shekhar 6,065 Reputation points Microsoft External Staff Moderator

2026-04-20T19:01:10.78+00:00
On April 8, the GPU node became unhealthy due to an underlying Azure host issue, not due to customer workloads or configuration.

The failure was caused by a physical host problem impacting the node’s availability.

Azure attempted an automatic restart, which failed due to detected hardware issues.

The node was successfully redeployed to a healthy host by Azure.

The node recovered at approximately 13:22 UTC with a total impact duration of ~42 minutes.

The outage impacted GPU workloads as the node pool had only one node with no redundancy.

No autoscaling or Availability Zones were configured, leaving no fallback capacity.

This was a platform (host-level) failure and not related to AKS configuration or application issues.

No corrective action is required from your side as recovery was handled automatically by Azure.

To reduce future risk, it is recommended to add multiple GPU nodes in the node pool.

Enabling Availability Zones can improve resilience against single host failures.

Configuring autoscaling can help maintain availability during node-level issues.

With redundancy in place, workloads can continue running even if one node fails.

Answer accepted by question author

1 additional answer

Your answer

Himanshu Shekhar 6,065 Reputation points Microsoft External Staff Moderator

2026-04-10T01:08:40.65+00:00

Just checking if provided response was helpful! please let me know if you have any queries.
Dimitris Bratsos 20 Reputation points

2026-04-13T20:59:13.0266667+00:00

Thank you for your response!

As replied privately, the automatic response was not the most helpful, but your response gave a lot of clarity. Also waiting on the resolution of the ticket based on this post, to ensure the hypothesis above is correct
Himanshu Shekhar 6,065 Reputation points Microsoft External Staff Moderator

2026-04-20T19:01:10.78+00:00

On April 8, the GPU node became unhealthy due to an underlying Azure host issue, not due to customer workloads or configuration.

The failure was caused by a physical host problem impacting the node’s availability.

Azure attempted an automatic restart, which failed due to detected hardware issues.

The node was successfully redeployed to a healthy host by Azure.

The node recovered at approximately 13:22 UTC with a total impact duration of ~42 minutes.

The outage impacted GPU workloads as the node pool had only one node with no redundancy.

No autoscaling or Availability Zones were configured, leaving no fallback capacity.

This was a platform (host-level) failure and not related to AKS configuration or application issues.

No corrective action is required from your side as recovery was handled automatically by Azure.

To reduce future risk, it is recommended to add multiple GPU nodes in the node pool.

Enabling Availability Zones can improve resilience against single host failures.

Configuring autoscaling can help maintain availability during node-level issues.

With redundancy in place, workloads can continue running even if one node fails.

Answer 1

Dimitris Bratsos Based on the timeline and the events you have shared,

Node event timeline and would like to clarify the behavior observed. The node failure was sudden and unplanned, and there was no prior drain or maintenance activity initiated on the node. This is confirmed by the absence of any drain or cordon events before the node transitioned to a NotReady state.

Following the node going down, Kubernetes automatically initiated pod eviction as part of its standard recovery mechanism. This is an expected behavior where the platform tries to move workloads away from an unhealthy node.

Regarding the observed PVC (Persistent Volume Claim) spike, this is not due to increased disk usage. Instead, it is related to temporary disk attachment contention. Since the affected node became unresponsive, the disks attached to it could not be immediately detached. When Kubernetes attempted to reschedule workloads onto another node, the disk reattachment failed initially because the disks were still logically attached to the original node.

This condition persisted for a short duration until the platform completed backend remediation and safely detached the disks. Once this process was completed, the disks were successfully reattached to the new node and the workloads resumed normal operation.

In summary, the behavior observed (PVC spike and workload delay) is consistent with an unexpected node failure followed by standard recovery and disk reattachment handling, and not due to any pre-planned drain or workload-triggered spike.

Himanshu Shekhar 6,065 Reputation points Microsoft External Staff Moderator

2026-04-23T19:25:14.09+00:00

Dimitris Bratsos If you have any further queries, let me know. If the information is helpful, please click on Upvote and Accept Answer on it.

Answer 2

In AKS, a node entering NotReady and then being rebooted/reimaged/redeployed is consistent with the platform’s node auto-repair and underlying VM/platform maintenance behavior.

From the sequence described:

VMEventScheduled: Redeploy Scheduled / Resource Health: Redeploying to different host
Node disruption events
KubeletNotReady with NetworkPluginNotReady (cni plugin not initialized)
AKS auto-repair events: NodeRebootStart → NodeRebootError: DataDisksForceDetached → NodeReimageStart → NodeReimageError: DataDisksForceDetached → NodeRedeployStart → NodeRedeployEnd

the key points from the provided context are:

When a node is NotReady for more than 5 minutes, AKS node auto-repair starts a sequence of actions: reboot → reimage → redeploy. Each step is surfaced as Kubernetes events such as NodeRebootStart, NodeReimageStart, and NodeRedeployStart from the aks-auto-repair source.
These events are informational about the repair actions; they do not themselves indicate the original root cause. Auto-repair is a best-effort mechanism and may fail with errors, which are also emitted as events.

From the context:

AKS auto-repair behavior and events are documented, including the exact reasons and messages you are seeing (NodeRebootStart, NodeReimageStart, NodeRedeployStart, etc.).
A node enters NotReady for several reasons, including VM host faults and API server unavailability. VM host faults and platform maintenance are explicitly called out as causes that can put a node into NotReady and require rescheduling or repair.

Given this, the answers to the specific questions are:

What triggered the initial VM redeploy (VMEventScheduled: Redeploy Scheduled) for this instance?
From the available context, the trigger category is:
- The node was in NotReady state long enough for AKS node auto-repair to run through reboot and reimage, then proceed to redeploy.
- VM host faults or platform maintenance are explicitly listed as causes that can lead to node NotReady and subsequent repair/redeploy actions.
The exact platform-internal reason (for example, a specific hardware fault vs. a specific maintenance operation) is not exposed in the provided documentation beyond “VM host faults” and “platform/network maintenance” as the classes of causes. To distinguish between those, the guidance is to check AKS diagnostics, Azure Status, and Azure notifications/maintenance events for that time window.
Was this due to underlying host hardware issues or platform maintenance?
The context confirms that both are valid and common causes for a node to become NotReady and for the platform to redeploy the VM to a different host. However, the context does not provide a way to definitively classify a specific incident as “hardware fault” vs. “scheduled maintenance” beyond correlating with Azure maintenance notifications and Resource Health.
Why did the node become NotReady and trigger auto-repair, given CPU ~81% and memory ~70%?
- The documented causes for NotReady include API server unavailability and VM host faults, not only resource pressure.
- Disk, memory, and PID pressure conditions being False aligns with the idea that this was not a resource-pressure-induced instability but rather a platform/host-level issue or network/control-plane issue.
- Once the node stayed NotReady for more than 5 minutes, the AKS node auto-repair sequence began, which matches the events you observed.
How to further validate root cause with the available guidance:
- Use kubectl describe node <node-name> (already done) to review events and confirm the aks-auto-repair events and any preceding warnings.
- Check AKS diagnostics, Azure Status, and Azure notifications for the timeframe of VMEventScheduled: Redeploy Scheduled to see if there were VM host faults or platform/network maintenance affecting that node.
- If the node continues to show NotReady or similar behavior, follow the basic node NotReady troubleshooting guidance and, if needed, delete the node to allow rescheduling to a different host.

The behavior described (node NotReady → auto-repair events → redeploy to a different host) is consistent with the documented AKS node auto-repair process reacting to a node that remained unhealthy, typically due to VM host/platform issues rather than node-level CPU/memory pressure.

References:

Share via

Node restarted after being marked as NotReady

1 additional answer

Your answer