Node count in AKS increased without approval/manual operation after SP renewal

Question

Node count in AKS increased without approval/manual operation after SP renewal

Anonymous

After SP renewal, node scaled up in agent pools even when the condition for scaling was manual

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Anonymous

2025-02-27T17:03:08.7533333+00:00

Hi Geethasri,

Thanks for the update. I have verified that the autoscaling is false by default and the scale method within the node pools is in the manual state even after the SP renewal. Also verified that no automation is configured wrt SP renewal. This is the 2nd time I've faced the issue this month!

Thank You.
Anonymous

2025-02-28T05:43:14.4233333+00:00

Hi Sajin Seby,

It seems like something out of the ordinary initiated the scaling, despite autoscaling and manual scaling being properly configured. It might be that the SP renewal somehow triggered a reset or conflict in AKS configurations, resulting in unexpected behavior.

It might also be caused by an unseen dependency or default behavior in Azure following the SP renewal, even if no automation is implemented on your side. Since this happened twice this month, I’d recommend closely monitoring the AKS activity logs around the renewal time to see if any other changes or system-level actions are occurring that could be influencing scaling.

If you have any further queries, please let us know we are glad to help you.
Anonymous

2025-02-28T05:54:46.8466667+00:00

Hi Geethasri,

I could not find the option to raise a ticket. Every time I try to create a support ticket, it takes me to either documentation or Q&A page.
Anonymous

2025-02-28T17:13:06.5733333+00:00

Hi Sajin Seby,

Please go through this document for your reference
https://learn.microsoft.com/en-us/azure/aks/scale-cluster?source=recommendations&tabs=azure-cli
If you have any further question let me know.

Thank you.
Anonymous

2025-03-06T08:47:13.97+00:00

Hi Sajin Seby,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-07T04:03:16.7233333+00:00

Hi Geethasri,

Nothing helped me to resolve my issue. Also, I cannot raise a support request via Azure even though we have support plans. Every time it redirects me to the documentation page.

Thanks
Anonymous

2025-03-07T12:45:52.8633333+00:00
Hi Sajin Seby,

Reasons for Node Count Increase

Cluster Autoscaler:

Automatically adjusts the number of nodes in response to resource demands.

If workloads increase, the autoscaler may add nodes without manual intervention.

Scheduled Jobs :

If there are scheduled tasks that require more resources, they could trigger an increase in node count.

Configuration Changes:

Changes in configuration or policies related to scaling might have been applied, leading to an increase in nodes.

Service Principal Permissions:

The renewed service principal may have permissions that allow it to modify the cluster settings, including scaling operations.

Monitoring and Alerts:

Check if there are any monitoring tools or alerts that might have triggered scaling actions based on predefined thresholds.

Recommended Actions

Review Cluster Autoscaler Settings:

Verify if the autoscaler is enabled and check its configuration.

Audit Logs:

Examine the audit logs for any actions taken by the service principal or other users that could explain the increase.

Scaling Policies:

Review any scaling policies that may have been set up to ensure they align with your operational requirements.

Service Principal Permissions:

Ensure that the service principal has the appropriate permissions and that no unnecessary permissions are granted. If you have any further queries, please let us know we are glad to help you. Thank you
Anonymous

2025-03-10T08:48:51.6166667+00:00

Hi Sajin Seby,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-10T15:54:04.7433333+00:00

Hi Sajin Seby,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-11T04:13:51.41+00:00

Hi,

Nothing seems to be helpful.

Thanks
Anonymous

2025-03-11T10:57:35.9+00:00
Hi Sajin Seby,

Check Cluster Autoscaler Settings

Ensure that the cluster autoscaler is configured correctly.

Review the 'ENABLE_CLUSTER_AUTOSCALER 'parameter in your configuration file. If it's set to "true," the autoscaler will automatically adjust the number of nodes based on workload demands to your AKS workload.

Disable Autoscaler (if necessary)

If you enable autoscaler , just consider disabling the cluster autoscaler by setting 'ENABLE_CLUSTER_AUTOSCALER 'to "false."

This will prevent automatic adjustments to the node count, allowing you to control scaling through manual operations only.

Check the AKS cluster settings to ensure the scaling configuration is indeed set to manual.

Review any logs and events around the time of SP renewal to see if something triggered the scaling action.

Ensure that the Service Principal permissions are correctly configured post-renewal, especially for scaling-related operations.

Double-check if any automated scripts or Azure policies are in place that could have affected scaling behavior. If you have any further queries, please let us know we are glad to help you.
If it was helpful, please click "Upvote" on this post to let us know. Thank You.
Anonymous

2025-03-11T12:02:22.6266667+00:00

Hi,

I have verified the above but could find nothing useful.

Thanks
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-12T16:33:47.25+00:00
Hello Sajin Seby, from the above conversation I understand that you have an unplanned increase in Node count, and you want to figure the root cause for the same. For that we will need some additional details from you to dig deeper on this.

First of all, is this a production environment cluster? If not, then which environment is it in?
Secondly what is the AKS version on this cluster and is Azure Monitor enabled on this cluster?

Could you kindly update your original question with the outputs for below commands?

kubectl get events --sort-by=.metadata.creationTimestamp -A | grep -i scale

Look for any entries related to NodeAdded or ScaleUp

az monitor activity-log list --resource-group arkrg --output table

Look for any actions related to "Microsoft.ContainerService/managedClusters/scale" or node-related activity.

Sometimes VMSS can trigger scaling independently so need to check that part as well or If the VMSS encountered failures, Azure may have triggered a repair, which can cause scaling changes.

az vmss list-instances --resource-group MC_arkorg_<AKS_CLUSTER_NAME>_<REGION> --name <NODE_POOL_VMSS_NAME> --output table

and

az vmss get-instance-view --resource-group MC_arkorg_<AKS_CLUSTER_NAME>_<REGION> --name <NODE_POOL_VMSS_NAME> --output json

Microsoft sometimes applies background upgrades which can modify node count so we need to cross check that as well

kubectl get events --sort-by=.metadata.creationTimestamp -A | grep -i "upgrade"

and last but not least need output for

az aks nodepool list --resource-group arkrg --cluster-name arkoaks --output json

Kindly share with us the requested details along with few additional details I have requested in private message so that we can assist you better. Thanks

4 answers

Your answer

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Anonymous

2025-02-27T17:03:08.7533333+00:00

Hi Geethasri,

Thanks for the update. I have verified that the autoscaling is false by default and the scale method within the node pools is in the manual state even after the SP renewal. Also verified that no automation is configured wrt SP renewal. This is the 2nd time I've faced the issue this month!

Thank You.
Anonymous

2025-02-28T05:43:14.4233333+00:00

Hi Sajin Seby,

It seems like something out of the ordinary initiated the scaling, despite autoscaling and manual scaling being properly configured. It might be that the SP renewal somehow triggered a reset or conflict in AKS configurations, resulting in unexpected behavior.

It might also be caused by an unseen dependency or default behavior in Azure following the SP renewal, even if no automation is implemented on your side. Since this happened twice this month, I’d recommend closely monitoring the AKS activity logs around the renewal time to see if any other changes or system-level actions are occurring that could be influencing scaling.

If you have any further queries, please let us know we are glad to help you.
Anonymous

2025-02-28T05:54:46.8466667+00:00

Hi Geethasri,

I could not find the option to raise a ticket. Every time I try to create a support ticket, it takes me to either documentation or Q&A page.
Anonymous

2025-02-28T17:13:06.5733333+00:00

Hi Sajin Seby,

Please go through this document for your reference
https://learn.microsoft.com/en-us/azure/aks/scale-cluster?source=recommendations&tabs=azure-cli
If you have any further question let me know.

Thank you.
Anonymous

2025-03-06T08:47:13.97+00:00

Hi Sajin Seby,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-07T04:03:16.7233333+00:00

Hi Geethasri,

Nothing helped me to resolve my issue. Also, I cannot raise a support request via Azure even though we have support plans. Every time it redirects me to the documentation page.

Thanks
Anonymous

2025-03-07T12:45:52.8633333+00:00

Hi Sajin Seby,

Reasons for Node Count Increase

Cluster Autoscaler:

Automatically adjusts the number of nodes in response to resource demands.

If workloads increase, the autoscaler may add nodes without manual intervention.

Scheduled Jobs :

If there are scheduled tasks that require more resources, they could trigger an increase in node count.

Configuration Changes:

Changes in configuration or policies related to scaling might have been applied, leading to an increase in nodes.

Service Principal Permissions:

The renewed service principal may have permissions that allow it to modify the cluster settings, including scaling operations.

Monitoring and Alerts:

Check if there are any monitoring tools or alerts that might have triggered scaling actions based on predefined thresholds.

Recommended Actions

Review Cluster Autoscaler Settings:

Verify if the autoscaler is enabled and check its configuration.

Audit Logs:

Examine the audit logs for any actions taken by the service principal or other users that could explain the increase.

Scaling Policies:

Review any scaling policies that may have been set up to ensure they align with your operational requirements.

Service Principal Permissions:

Ensure that the service principal has the appropriate permissions and that no unnecessary permissions are granted. If you have any further queries, please let us know we are glad to help you. Thank you
Anonymous

2025-03-10T08:48:51.6166667+00:00

Hi Sajin Seby,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-10T15:54:04.7433333+00:00

Hi Sajin Seby,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-11T04:13:51.41+00:00

Hi,

Nothing seems to be helpful.

Thanks
Anonymous

2025-03-11T10:57:35.9+00:00

Hi Sajin Seby,

Check Cluster Autoscaler Settings

Ensure that the cluster autoscaler is configured correctly.

Review the 'ENABLE_CLUSTER_AUTOSCALER 'parameter in your configuration file. If it's set to "true," the autoscaler will automatically adjust the number of nodes based on workload demands to your AKS workload.

Disable Autoscaler (if necessary)

If you enable autoscaler , just consider disabling the cluster autoscaler by setting 'ENABLE_CLUSTER_AUTOSCALER 'to "false."

This will prevent automatic adjustments to the node count, allowing you to control scaling through manual operations only.

Check the AKS cluster settings to ensure the scaling configuration is indeed set to manual.

Review any logs and events around the time of SP renewal to see if something triggered the scaling action.

Ensure that the Service Principal permissions are correctly configured post-renewal, especially for scaling-related operations.

Double-check if any automated scripts or Azure policies are in place that could have affected scaling behavior. If you have any further queries, please let us know we are glad to help you.
If it was helpful, please click "Upvote" on this post to let us know. Thank You.
Anonymous

2025-03-11T12:02:22.6266667+00:00

Hi,

I have verified the above but could find nothing useful.

Thanks
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-12T16:33:47.25+00:00

Hello Sajin Seby, from the above conversation I understand that you have an unplanned increase in Node count, and you want to figure the root cause for the same. For that we will need some additional details from you to dig deeper on this.

First of all, is this a production environment cluster? If not, then which environment is it in?
Secondly what is the AKS version on this cluster and is Azure Monitor enabled on this cluster?

Could you kindly update your original question with the outputs for below commands?

kubectl get events --sort-by=.metadata.creationTimestamp -A | grep -i scale

Look for any entries related to NodeAdded or ScaleUp

az monitor activity-log list --resource-group arkrg --output table

Look for any actions related to "Microsoft.ContainerService/managedClusters/scale" or node-related activity.

Sometimes VMSS can trigger scaling independently so need to check that part as well or If the VMSS encountered failures, Azure may have triggered a repair, which can cause scaling changes.

az vmss list-instances --resource-group MC_arkorg_<AKS_CLUSTER_NAME>_<REGION> --name <NODE_POOL_VMSS_NAME> --output table

and

az vmss get-instance-view --resource-group MC_arkorg_<AKS_CLUSTER_NAME>_<REGION> --name <NODE_POOL_VMSS_NAME> --output json

Microsoft sometimes applies background upgrades which can modify node count so we need to cross check that as well

kubectl get events --sort-by=.metadata.creationTimestamp -A | grep -i "upgrade"

and last but not least need output for

az aks nodepool list --resource-group arkrg --cluster-name arkoaks --output json

Kindly share with us the requested details along with few additional details I have requested in private message so that we can assist you better. Thanks

Answer 1

Hi Sajin Seby,

Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

We understand from your query that you are experiencing an issue with node count increases when you renewed SP.

It appears that the growth in node numbers may be because of automated functionality within AKS, such as the Cluster Auto scaler, which may have been activated following the SP renewal. When the Service Principal was renewed, it may have reinstated permissions or activated a policy to enable autoscaling to take place, particularly if there were resource requirements.

It's also conceivable that certain automated tasks or scripts involving the SP renewal inadvertently caused scaling activities. To avoid this in the future, I'd recommend verifying the autoscaling configuration and checking any automation associated with the SP to make sure everything is set up as desired.

If the answer was helpful, please don't forget to upvote and accept answer the answer

Thank you.

Answer 2

Arko 4,150 Microsoft External Staff Moderator

Hi Sajin Seby,

Thank you for sharing the details regarding the unexpected node count increase in your AKS cluster. After a thorough investigation and discussions with the Microsoft product team, here are the key findings from our internal product team

Your AKS cluster is currently running Kubernetes version 1.28.5, which reached end-of-life in January 2025. As per Microsoft's AKS support policy, clusters running unsupported versions do not receive runtime guarantees, and unexpected behavior may occur. Would recommend you upgrade your cluster to at least Kubernetes 1.30 to ensure continued support and stability.

No forced scale-up events were detected in the timeframe of interest (27th February 2025, between 10 AM – 12 PM IST). The kubectl get events and az vmss list-instances commands did not reveal any unexpected scaling activities.

Most importantly, Node Pool always had a count of 1.

From our internal logs for your agent pool snapshot data, it was confirmed that the discussed node pool had a consistent node count of 1 since 8th January 2025. There was no sudden scale-up on 27th February 2025, as the node count was already set to 1 before this date.

As discussed in private message, your cluster is managed via Terraform, and while no Terraform apply operation was explicitly identified during the timeframe, it is possible that a previous Terraform state enforcement may have contributed to the node count being maintained at 1.

Please be assured, there was no sudden scale-up event detected.

Anonymous

2025-03-26T05:22:23.23+00:00

Hi, I have encountered this in 2 different clusters during SP renewal. Can you check the events and logs during the time '2025-01-08T12:18:03Z'? The SP expiry of the qanew cluster is showing as '2026-01-08T12:18:03Z'. I had wrongly mentioned the day of issue in the above message. Please check with this time and let me know if you have further queries.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-26T06:11:41.98+00:00

Sure. Will check and update you.
Anonymous

2025-03-26T06:21:08.3766667+00:00

Thanks
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-26T15:52:59.4533333+00:00

Hello Sajin Seby, looking at the cluster status around Jan 8th during the mentioned timeframe we can see the there was an operation to reset the cluster SP and this operation failed draining agent-pool nodes with a PDB error.
Anonymous

2025-03-27T04:28:19.71+00:00

Scaling was happened from 1 to 2. After the SP renewal, we found 1 extra node. This occurred in another cluster too on the date I previously gave you
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-27T06:06:58.8333333+00:00

Can you give me that another cluster which encountered similar situation as well? Please give it to me in Private message. I have pinged you there.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-27T06:08:10.8566667+00:00

Thank you for the second cluster details. I will get that checked as well and update you.
Anonymous

2025-03-27T09:12:07.3533333+00:00

Sure, Thanks
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-27T15:25:30.69+00:00

As part of the SP update process a reimage of the nodes will take place https://learn.microsoft.com/en-us/azure/aks/update-credentials#reset-the-existing-service-principal-credentials
Anonymous

2025-04-21T11:28:44.71+00:00

Hi, please share your conclusion.
Bheemani Anji Babu 430 Reputation points Microsoft External Staff Moderator

2025-04-25T18:36:30.2366667+00:00

Hi @Anonymous
We are looking into it and will share steps soon. Meanwhile, it might help to check the Activity Log for any triggers, review the Service Principal permissions after renewal, and double-check autoscaler settings or any scripts that might've scaled the nodes. If you can share your AKS version and whether Azure Monitor is enabled, that'll help too. We’ll update you shortly! Thanks.

Answer 3

Hi Sajin Seby •

I hope you are doing well.

Could you please let us know if this issue still happening?

Also, after renewing a service principal (SP) in Azure, it's possible for nodes in your Azure Kubernetes Service (AKS) cluster to scale up, even if you're using a manual scaling configuration. This can happen because the SP renewal might inadvertently trigger a change in the cluster's authentication or authorization settings, leading to unexpected scaling behavior.

Here is the reference for that:https://learn.microsoft.com/en-us/azure/aks/update-credentials#reset-the-existing-service-principal-credentials

Troubleshooting:

If you've experienced this issue, you should:

Examine the scaling history: Check the Azure portal or the AKS cluster logs to see if there's any indication of when the node scaling occurred and what might have triggered it.

Review Cluster Autoscaler settings.

Verify SP Permissions.

Consider Managed Identities: Instead of relying on SP's, consider using Managed Identities for authentication, as they offer more automated management of credentials.

Consult Microsoft Support: If you're still struggling to resolve the issue, you may need to file a support ticket with Microsoft for deeper investigation.

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

If you have any other questions or are still running into more issues, let me know in the "comments" and I would be happy to help you.

Thank You.

Lisboa

Answer 4

LISBOA-4826 245 Volunteer Moderator

Hi Sajin Seby,

Can you please provide a feedback on our last answer? It's by design trigged the recreation of instances inside the nodepool on AKS when PS is renewed.

Thank You.

Lisboa

Share via

Node count in AKS increased without approval/manual operation after SP renewal

4 answers

Your answer