A nodepool upgrade will cause downtime for your AKS cluster

Question

A nodepool upgrade will cause downtime for your AKS cluster

DisplayName42 56

The documentation of "Update an AKS cluster to use a managed identity" has the following warning:

A nodepool upgrade will cause downtime for your AKS cluster as the nodes in the nodepools will be cordoned/drained and then reimaged.

However the documentation in "Upgrade an AKS cluster" states that it will

Cordon and drain one of the old nodes to minimize disruption to running applications.

Does updating an AKS cluster to use managed identity cause downtime or will it carefully cordon and drain the nodes one by one? What does "downtime" mean in the first quote? Does it mean that the cluster will be completely offline? If yes, how can I estimate the duration of the downtime?

Prrudram-MSFT 28,281 Reputation points Moderator

2023-02-23T18:13:10.4433333+00:00

Hi @DisplayName42

If an answer has been helpful, please consider accepting the answer to help increase visibility of this question for other members of the Microsoft Q&A community. If not, please let us know what is still needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!
Prrudram-MSFT 28,281 Reputation points Moderator

2023-03-07T13:06:16.1066667+00:00

Hi @DisplayName42

We noticed that you rated the above answer as not helpful. We value your feedback and want to see if there is anything we can do to improve it and make this a positive experience for you. Thanks!
Prrudram-MSFT 28,281 Reputation points Moderator

2023-03-07T16:05:10.72+00:00

@DisplayName42

Thanks for accepting the answer from Adrian Dobrescu. If you can upvote the same, it will be helpful in considering that this resolved the problem as per earlier feedback.

Accepted answer

1 additional answer

Your answer

Prrudram-MSFT 28,281 Reputation points Moderator

2023-02-23T18:13:10.4433333+00:00

Hi @DisplayName42

If an answer has been helpful, please consider accepting the answer to help increase visibility of this question for other members of the Microsoft Q&A community. If not, please let us know what is still needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!
Prrudram-MSFT 28,281 Reputation points Moderator

2023-03-07T13:06:16.1066667+00:00

Hi @DisplayName42

We noticed that you rated the above answer as not helpful. We value your feedback and want to see if there is anything we can do to improve it and make this a positive experience for you. Thanks!
Prrudram-MSFT 28,281 Reputation points Moderator

2023-03-07T16:05:10.72+00:00

@DisplayName42

Thanks for accepting the answer from Adrian Dobrescu. If you can upvote the same, it will be helpful in considering that this resolved the problem as per earlier feedback.

Answer 1

Good day,

Thank you for reaching us!

As its is stated in the documentation, an upgrade process consist of the following:

Add a new buffer node (or as many nodes as configured in max surge) to the cluster that runs the specified Kubernetes version.
Cordon and drain one of the old nodes to minimize disruption to running applications. If you're using max surge, it will cordon and drain as many nodes at the same time as the number of buffer nodes specified.
When the old node is fully drained, it will be reimaged to receive the new version, and it will become the buffer node for the following node to be upgraded.
This process repeats until all nodes in the cluster have been upgraded.
At the end of the process, the last buffer node will be deleted, maintaining the existing agent node count and zone balance.
You can also customize a node surge upgrade depending on the requirements/need you might have: a faster upgrade with a downtime(maybe for testing environments) or a 33% max surge recommended for production environments.
If you stick with the recommended option for production, there won't be a noticeable downtime for your applications.
You can refer to this document as well for more information and examples:

https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#customize-node-surge-upgrade

Please let us know if you have any further questions and we will be glad to assist you further. Thank you!

Please "Accept as Answer" and Upvote if it helped, so that it can help others in the community looking for help on similar topics.

Anthony Kamau 0 Reputation points

2025-04-21T00:55:07.0033333+00:00

I upgraded the nodepool from 2vcpu to 4vcpu and 3 of the 4 nodes all went down at the same time:

YAML

$ > k get nodes NAME STATUS ROLES AGE VERSION aks-nodepool1-15524858-vmss000000 Ready <none> 3d1h v1.30.7 aks-nodepool1-15524858-vmss000001 NotReady <none> 3d1h v1.30.7 aks-nodepool1-15524858-vmss000009 NotReady <none> 3d1h v1.30.7 aks-nodepool1-15524858-vmss00000a NotReady <none> 3d1h v1.30.7

The workloads that were currently running could not all fit on the 2vcpu node that was left in Ready state so there was downtime.

Can you explain why this happened and not what you describe in your answer above?

Answer 2

hello @DisplayName42

Thank you for your question, once you update your aks cluster to use MSI, it will update the node, with the new Client ID, which means it will re-image the nodes one by one until all the nodes will be re-imaged to use the new MSI.

if your application use deployment with 3 replicas, that will not downtime your applications, as the first node will be re-imaged, once it has been done it will re-image the second node with the new changes.

Upgrading a node pool in Azure Kubernetes Service (AKS) can cause downtime for your cluster. According to the documentation, during the upgrade process, AKS will cordon and drain one of the old nodes to minimize disruption to running applications. When the old node is fully drained, it will be reimaged to receive the new version, and it will become the buffer node for the following node to be upgraded. This process repeats until all nodes in the cluster have been upgraded. At the end of the process, the last buffer node will be deleted, maintaining the existing agent node count and zone balance.

Please "Accept as Answer" and Upvote if it helped, so that it can help others in the community looking for help on similar topics.

Share via

A nodepool upgrade will cause downtime for your AKS cluster

1 additional answer

Your answer