AKS: Issues with Pod scaling in certain scenarios?

Question

AKS: Issues with Pod scaling in certain scenarios?

Jim Zhang (XBOX) 20 Microsoft Employee

Has anyone noticed issues when trying to adjust the auto scale pod count? Using the below interface below -
User's image

I noticed that it took over 30-60 minutes for any response. Is that normal?

We noticed this issue, and our auto scale was not working at all .. User's image

more context

Node pool : ind16 stuck at 1 Node even when BusyPercentage was over 95% for several hours. Had to manually scale it up which took almost an hour . Not too sure why autoscale is not working here , but it seems like we should have scaled up automatically. related incident here: [REDACTED_SUBSCRIPTION_ID]

Manish Deshpande 7,255 Reputation points Microsoft External Staff Moderator

2026-02-11T02:01:58.62+00:00

Hello Jim Zhang
I wanted to check if my last response made sense. I’d be glad to assist further or explain anything in more detail and please accept as Yes and upvote if the answer is helpful so that it can help others in the community.
Manish Deshpande 7,255 Reputation points Microsoft External Staff Moderator

2026-02-13T00:59:32.09+00:00

Hello Jim Zhang

I wanted to check if my last response made sense. I’d be glad to assist further or explain anything in more detail and please accept as Yes and upvote if the answer is helpful so that it can help others in the community.

Answer accepted by question author

Manish Deshpande 7,255 Microsoft External Staff Moderator

Hello Jim Zhang

Thank you for contacting us about the AKS pod issue.

Steps to fix the issue:

1.Confirm HPA is actually receiving metrics

Run the commands listed below
kubectl get hpa

kubectl describe hpa <hpa-name>

Check for:

Current CPU utilization
No metrics not available warnings

If metrics are delayed or missing, HPA cannot scale.

2.Tune HPA scale‑up behavior

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 300
      periodSeconds: 15

This allows pods to scale faster during spikes instead of step‑by‑step defaults.

3.Verify pod resource requests are realistic

If pod CPU/memory requests are too low, the scheduler may still place pods, preventing node scale‑up.

Run

kubectl describe pod <pod-name>

4.Validate node pool autoscaler limits

Ensure:

Node pool max node count is high enough
No quota exhaustion at subscription or VM family level

If max is reached, autoscaler silently stops scaling.

5.Temporary fix:

When immediate recovery was required:

Manually scaled node pool to relieve pressure
Then applied HPA tuning to prevent recurrence

Manual scaling is safe and commonly used during live incidents.

Conclusion:
This behavior is not a platform outage and not a bug in most cases. It is a result of:

AKS autoscaling design
Conservative default HPA policies
Metrics collection intervals under load

After tuning HPA behavior and validating resource requests, autoscaling stabilized and responded correctly in similar production environments.

Links:
https://learn.microsoft.com/en-us/azure/aks/concepts-scale

https://github.com/Azure/AKS/issues/5520

Thanks,
Manish

0 comments

1 additional answer

Your answer

Manish Deshpande 7,255 Reputation points Microsoft External Staff Moderator

2026-02-11T02:01:58.62+00:00

Hello Jim Zhang
I wanted to check if my last response made sense. I’d be glad to assist further or explain anything in more detail and please accept as Yes and upvote if the answer is helpful so that it can help others in the community.
Manish Deshpande 7,255 Reputation points Microsoft External Staff Moderator

2026-02-13T00:59:32.09+00:00

Hello Jim Zhang

I wanted to check if my last response made sense. I’d be glad to assist further or explain anything in more detail and please accept as Yes and upvote if the answer is helpful so that it can help others in the community.

Answer 1

The issue you are experiencing with the AKS cluster autoscaler not scaling up despite the BusyPercentage being over 95% could be attributed to several factors. Here are some common causes and mitigation recommendations:

Core Quota Exhaustion: If the approved core quota has been exhausted, the cluster autoscaler will not be able to scale up. You can request a quota increase to resolve this issue.
Max Size of Node Pool: Ensure that the maximum number of nodes in your node pool is set appropriately. If it is reached, the autoscaler will not be able to add more nodes.
Requests/Calls Exceeding the Rate Limit: If you are hitting rate limits, it may prevent the autoscaler from functioning properly. You can check for 429 Too Many Requests errors to diagnose this.
Node Pool in Backoff: If there have been multiple failed scale-up attempts, the node pool may enter a backoff state. This means that the autoscaler will wait before attempting to scale again. You can reset the node pool's backoff state by disabling and then re-enabling autoscaling.
Pod Preventing Node Drain: If there are pods that cannot be evicted from nodes, this can prevent scaling down and subsequently affect scaling up. Ensure that your pod configurations allow for safe eviction.

To troubleshoot further, you may want to check the logs of the cluster autoscaler for any specific error messages that could provide more insight into why the scaling did not occur as expected.

References:

Cluster autoscaling in Azure Kubernetes Service (AKS) overview

Share via

AKS: Issues with Pod scaling in certain scenarios?

1 additional answer

Your answer