Share via

AKS: Issues with Pod scaling in certain scenarios?

Jim Zhang (XBOX) 20 Reputation points Microsoft Employee
2026-02-02T23:41:02.75+00:00

Has anyone noticed issues when trying to adjust the auto scale pod count? Using the below interface below -
User's image

I noticed that it took over 30-60 minutes for any response. Is that normal?

We noticed this issue, and our auto scale was not working at all .. User's image

more context

Node pool : ind16 stuck at 1 Node even when BusyPercentage was over 95% for several hours. Had to manually scale it up which took almost an hour . Not too sure why autoscale is not working here , but it seems like we should have scaled up automatically. related incident here: [REDACTED_SUBSCRIPTION_ID]

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


Answer accepted by question author

Manish Deshpande 7,255 Reputation points Microsoft External Staff Moderator
2026-02-03T00:00:01.12+00:00

Hello Jim Zhang

Thank you for contacting us about the AKS pod issue.

Steps to fix the issue:

1.Confirm HPA is actually receiving metrics

Run the commands listed below
kubectl get hpa

kubectl describe hpa <hpa-name>

Check for:

  • Current CPU utilization
  • No metrics not available warnings

If metrics are delayed or missing, HPA cannot scale.

2.Tune HPA scaleup behavior

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 300
      periodSeconds: 15

This allows pods to scale faster during spikes instead of step‑by‑step defaults.

3.Verify pod resource requests are realistic

If pod CPU/memory requests are too low, the scheduler may still place pods, preventing node scale‑up.

Run

kubectl describe pod <pod-name>

4.Validate node pool autoscaler limits

Ensure:

  • Node pool max node count is high enough
  • No quota exhaustion at subscription or VM family level

If max is reached, autoscaler silently stops scaling.

5.Temporary fix:

When immediate recovery was required:

  • Manually scaled node pool to relieve pressure
  • Then applied HPA tuning to prevent recurrence

Manual scaling is safe and commonly used during live incidents.

Conclusion:
This behavior is not a platform outage and not a bug in most cases. It is a result of:

  • AKS autoscaling design
  • Conservative default HPA policies
  • Metrics collection intervals under load

After tuning HPA behavior and validating resource requests, autoscaling stabilized and responded correctly in similar production environments.

Links:
https://learn.microsoft.com/en-us/azure/aks/concepts-scale

https://github.com/Azure/AKS/issues/5520

Thanks,
Manish

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

1 additional answer

Sort by: Most helpful
  1. AI answer

    2026-02-02T23:42:42.49+00:00

    The issue you are experiencing with the AKS cluster autoscaler not scaling up despite the BusyPercentage being over 95% could be attributed to several factors. Here are some common causes and mitigation recommendations:

    1. Core Quota Exhaustion: If the approved core quota has been exhausted, the cluster autoscaler will not be able to scale up. You can request a quota increase to resolve this issue.
    2. Max Size of Node Pool: Ensure that the maximum number of nodes in your node pool is set appropriately. If it is reached, the autoscaler will not be able to add more nodes.
    3. Requests/Calls Exceeding the Rate Limit: If you are hitting rate limits, it may prevent the autoscaler from functioning properly. You can check for 429 Too Many Requests errors to diagnose this.
    4. Node Pool in Backoff: If there have been multiple failed scale-up attempts, the node pool may enter a backoff state. This means that the autoscaler will wait before attempting to scale again. You can reset the node pool's backoff state by disabling and then re-enabling autoscaling.
    5. Pod Preventing Node Drain: If there are pods that cannot be evicted from nodes, this can prevent scaling down and subsequently affect scaling up. Ensure that your pod configurations allow for safe eviction.

    To troubleshoot further, you may want to check the logs of the cluster autoscaler for any specific error messages that could provide more insight into why the scaling did not occur as expected.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.