@Thiagarajan Vasudevan , Thank you for your question.
Auto enable is safe feature.?
Cluster auto-upgrade only updates to GA versions of Kubernetes and will not update to preview versions.
Automatically upgrading a cluster follows the same process as manually upgrading a cluster.
However, The cluster auto-upgrade for AKS clusters is a preview feature.
AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:
For more information please check this article.
Need to full regression on code changes or basic Sanity?
Please refer to Kubernetes API Deprecation Policy and Changelog that matches your target Kubernetes version to understand if there are breaking changes introduced by the Kubernetes version. Also please review AKS release notes for changes in AKS system objects to understand impacts (if any). Based on the effects you can decide if a code regression or sanity check should be sufficient.
Is there any outage expected during this change?
- An AKS cluster upgrade triggers a cordon and drain of your nodes. If you have a low compute quota available, the upgrade may fail. For more information, see increase quotas
- Ensure that any
PodDisruptionBudgets
(PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation. - Also please ensure that the required endpoints to be accessible for AKS clusters are not blocked. [Reference]
- If using service principal as identity for the control plane's authentication please ensure that the service principal profile is updated with an unexpired client secret. [Reference]
- Please ensure that the certificates in the AKS cluster are valid. [Reference]
- Please ensure that the AKS cluster subnet has sufficient IP addresses so that a surge node can be provisioned during the upgrade process, else scale the cluster accordingly. If using Azure CNI please refer to this article. For more information please check here.
Running az aks upgrade
gives you a zero downtime way to apply updates. The command handles applying the latest updates to all your cluster's nodes, cordoning and draining traffic to the nodes, and restarting the nodes, then allowing traffic to the updated nodes. [Reference]
We recommend you to perform operations on non-prod environments before prod AKS clusters to be safe.
Please check AKS FAQ for more information on AKS upgrades
Duration expected per node basis?
It takes a few minutes to upgrade the cluster, depending on how many nodes you have.
By default, AKS configures upgrades to surge with one additional node. A default value of one for the max surge settings will enable AKS to minimize workload disruption by creating an additional node before the cordon/drain of existing applications to replace an older versioned node. The max surge value may be customized per node pool to enable a trade-off between upgrade speed and upgrade disruption. By increasing the max surge value, the upgrade process completes faster, but setting a large value for max surge may cause disruptions during the upgrade process.
For example, a max surge value of 100% provides the fastest possible upgrade process (doubling the node count) but also causes all nodes in the node pool to be drained simultaneously. You may wish to use a higher value such as this for testing environments. For production node pools, we recommend a max_surge setting of 33%.
For more information please check this article.
What are all the resources needs to be upgraded like cluster, ingress controller etc.?
During an AKS cluster upgrade the control plane node image and components (kube-apiserver, kube-controller-manager, kube-scheduler, etcd) , the agent pool node image and kubelet, and the CNI are upgraded. However, when using az aks upgrade
the AKS resource provider handles all of this. You might see changes in system objects like CSI driver pods, coredns pods etc. based on changes described in the AKS release notes. Depending on the changes introduced by the new Kubernetes version as described in the corresponding CHANGELOG, you might have to modify objects existing on the Kubernetes cluster like pods, deployments, ingresses etc.
Additional Information
Why do upgrades to Kubernetes 1.16 fail when using node labels with a kubernetes.io prefix?
As of Kubernetes 1.16 only a defined subset of labels with the kubernetes.io prefix can be applied by the kubelet to nodes. AKS cannot remove active labels on your behalf without consent, as it may cause downtime to impacted workloads.
As a result, to mitigate this issue you can:
- Upgrade your cluster control plane to 1.16 or higher
- Add a new nodepoool on 1.16 or higher without the unsupported kubernetes.io labels
- Delete the older node pool
AKS is investigating the capability to mutate active labels on a node pool to improve this mitigation.
----------
Hope this helps.
Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.