Share via

Node restarted after being marked as NotReady

Dimitris Bratsos 20 Reputation points
2026-04-08T18:36:39.43+00:00

Posting this question again, as I was not allowed to comment under my already existing question.

Our node was suddenly marked as NotReady, which then caused it to restart. It has been running with no issues over the past weeks, and no changes in its configuration are done.

Our node is running on version v1.34.2

We have disabled the automatic OS upgrades and we have a manual upgrade policy set. So it (in theory) should not be an automatic update of some sort.

From the journalctl through a debug pod in that node, we can see the following before the restart:

"Node became not ready" node="[REDACTED]" condition={"type":"Ready","status":"False","lastHeartbeatTime":"2026-04-08T13:20:44Z","lastTransitionTime":"2026-04-08T13:20:44Z","reason":"KubeletNotReady","message":"container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"}

We are trying to find out the root cause, and not sure if this is it. Another thing we noticed, was a spike in our PVCs, which caused them to fill up, but would not expect this to be connected, as I would only expect the pods to be affected, not the node itself.

Is there something else we might have missed, or any reason why the node restarted/got marked as NotReady in the first place?

I compiled a more detailed timeline:

  1. Azure scheduled a VM redeploy We observed:
  • VMEventScheduled: Redeploy Scheduled
    • Resource Health: Redeploying to different host
  1. Node disruption

Warning VMEventScheduled

Warning RedeployScheduled

  1. Kubelet reports node NotReady

KubeletNotReady

NetworkPluginNotReady (cni plugin not initialized)

  1. AKS auto-repair sequence
  • NodeRebootStart
  • NodeRebootError: DataDisksForceDetached
  • NodeReimageStart
  • NodeReimageError: DataDisksForceDetached
  • NodeRedeployStart
    • NodeRedeployEnd

From kubectl describe node:

DiskPressure: False (KubeletHasNoDiskPressure)

MemoryPressure: False

PIDPressure: False

Resource usage was approximately:

CPU: ~81%

  • Memory: ~70%

I would expect that these levels do not trigger node-level instability or automatic repair actions

Our current scenarios are the following:

To better understand the root cause, could you confirm:

  • What triggered the initial VM redeploy (VMEventScheduled: Redeploy Scheduled) for this instance?
  • Was this due to one of the reasons mentioned above?

We are specifically trying to identify the event preceding the node becoming NotReady. We appreciate the support, and we want to make sure the issue allows comments to be posted afterwards, to ensure full resolution.

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


Answer accepted by question author

  1. Himanshu Shekhar 6,065 Reputation points Microsoft External Staff Moderator
    2026-04-08T18:44:12.2866667+00:00

    Dimitris Bratsos Based on the timeline and the events you have shared,

    Node event timeline and would like to clarify the behavior observed. The node failure was sudden and unplanned, and there was no prior drain or maintenance activity initiated on the node. This is confirmed by the absence of any drain or cordon events before the node transitioned to a NotReady state.

    Following the node going down, Kubernetes automatically initiated pod eviction as part of its standard recovery mechanism. This is an expected behavior where the platform tries to move workloads away from an unhealthy node.

    Regarding the observed PVC (Persistent Volume Claim) spike, this is not due to increased disk usage. Instead, it is related to temporary disk attachment contention. Since the affected node became unresponsive, the disks attached to it could not be immediately detached. When Kubernetes attempted to reschedule workloads onto another node, the disk reattachment failed initially because the disks were still logically attached to the original node.

    This condition persisted for a short duration until the platform completed backend remediation and safely detached the disks. Once this process was completed, the disks were successfully reattached to the new node and the workloads resumed normal operation.

    In summary, the behavior observed (PVC spike and workload delay) is consistent with an unexpected node failure followed by standard recovery and disk reattachment handling, and not due to any pre-planned drain or workload-triggered spike.

    Was this answer helpful?

    2 people found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Q&A Assist
    2026-04-08T18:37:05.8333333+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    In AKS, a node entering NotReady and then being rebooted/reimaged/redeployed is consistent with the platform’s node auto-repair and underlying VM/platform maintenance behavior.

    From the sequence described:

    1. VMEventScheduled: Redeploy Scheduled / Resource Health: Redeploying to different host
    2. Node disruption events
    3. KubeletNotReady with NetworkPluginNotReady (cni plugin not initialized)
    4. AKS auto-repair events: NodeRebootStartNodeRebootError: DataDisksForceDetachedNodeReimageStartNodeReimageError: DataDisksForceDetachedNodeRedeployStartNodeRedeployEnd

    the key points from the provided context are:

    • When a node is NotReady for more than 5 minutes, AKS node auto-repair starts a sequence of actions: reboot → reimage → redeploy. Each step is surfaced as Kubernetes events such as NodeRebootStart, NodeReimageStart, and NodeRedeployStart from the aks-auto-repair source.
    • These events are informational about the repair actions; they do not themselves indicate the original root cause. Auto-repair is a best-effort mechanism and may fail with errors, which are also emitted as events.

    From the context:

    • AKS auto-repair behavior and events are documented, including the exact reasons and messages you are seeing (NodeRebootStart, NodeReimageStart, NodeRedeployStart, etc.).
    • A node enters NotReady for several reasons, including VM host faults and API server unavailability. VM host faults and platform maintenance are explicitly called out as causes that can put a node into NotReady and require rescheduling or repair.

    Given this, the answers to the specific questions are:

    1. What triggered the initial VM redeploy (VMEventScheduled: Redeploy Scheduled) for this instance?
      From the available context, the trigger category is:
      • The node was in NotReady state long enough for AKS node auto-repair to run through reboot and reimage, then proceed to redeploy.
      • VM host faults or platform maintenance are explicitly listed as causes that can lead to node NotReady and subsequent repair/redeploy actions.
      The exact platform-internal reason (for example, a specific hardware fault vs. a specific maintenance operation) is not exposed in the provided documentation beyond “VM host faults” and “platform/network maintenance” as the classes of causes. To distinguish between those, the guidance is to check AKS diagnostics, Azure Status, and Azure notifications/maintenance events for that time window.
    2. Was this due to underlying host hardware issues or platform maintenance?
      The context confirms that both are valid and common causes for a node to become NotReady and for the platform to redeploy the VM to a different host. However, the context does not provide a way to definitively classify a specific incident as “hardware fault” vs. “scheduled maintenance” beyond correlating with Azure maintenance notifications and Resource Health.
    3. Why did the node become NotReady and trigger auto-repair, given CPU ~81% and memory ~70%?
      • The documented causes for NotReady include API server unavailability and VM host faults, not only resource pressure.
      • Disk, memory, and PID pressure conditions being False aligns with the idea that this was not a resource-pressure-induced instability but rather a platform/host-level issue or network/control-plane issue.
      • Once the node stayed NotReady for more than 5 minutes, the AKS node auto-repair sequence began, which matches the events you observed.
    4. How to further validate root cause with the available guidance:
      • Use kubectl describe node <node-name> (already done) to review events and confirm the aks-auto-repair events and any preceding warnings.
      • Check AKS diagnostics, Azure Status, and Azure notifications for the timeframe of VMEventScheduled: Redeploy Scheduled to see if there were VM host faults or platform/network maintenance affecting that node.
      • If the node continues to show NotReady or similar behavior, follow the basic node NotReady troubleshooting guidance and, if needed, delete the node to allow rescheduling to a different host.

    The behavior described (node NotReady → auto-repair events → redeploy to a different host) is consistent with the documented AKS node auto-repair process reacting to a node that remained unhealthy, typically due to VM host/platform issues rather than node-level CPU/memory pressure.


    References:

    Was this answer helpful?

    1 person found this answer helpful.

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.