An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Hello Jim Zhang
Thank you for contacting us about the AKS pod issue.
Steps to fix the issue:
1.Confirm HPA is actually receiving metrics
Run the commands listed below
kubectl get hpa
kubectl describe hpa <hpa-name>
Check for:
-
Current CPU utilization - No
metrics not availablewarnings
If metrics are delayed or missing, HPA cannot scale.
2.Tune HPA scale‑up behavior
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 300
periodSeconds: 15
This allows pods to scale faster during spikes instead of step‑by‑step defaults.
3.Verify pod resource requests are realistic
If pod CPU/memory requests are too low, the scheduler may still place pods, preventing node scale‑up.
Run
kubectl describe pod <pod-name>
4.Validate node pool autoscaler limits
Ensure:
- Node pool max node count is high enough
- No quota exhaustion at subscription or VM family level
If max is reached, autoscaler silently stops scaling.
5.Temporary fix:
When immediate recovery was required:
- Manually scaled node pool to relieve pressure
- Then applied HPA tuning to prevent recurrence
Manual scaling is safe and commonly used during live incidents.
Conclusion:
This behavior is not a platform outage and not a bug in most cases. It is a result of:
- AKS autoscaling design
- Conservative default HPA policies
- Metrics collection intervals under load
After tuning HPA behavior and validating resource requests, autoscaling stabilized and responded correctly in similar production environments.
Links:
https://learn.microsoft.com/en-us/azure/aks/concepts-scale
https://github.com/Azure/AKS/issues/5520
Thanks,
Manish