An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Hello Subin,
Apologies for the late response.
Thanks for the quick follow-up and for sharing the exact error, general-purpose.yaml (NodePool), and aksnodeclass.yaml (AKSNodeClass). This gives us everything we need to pinpoint the root cause.
Quick Summary of the Issue
Your NAP cluster scales out and in faster than the CA cluster (as designed), but the continuous load test with NeoLoad is reporting ~20× more NL-NETWORK-01 (Network IO error sending request) failures. These are not NAP provisioning failures—they are client-side symptoms of temporary pod or endpoint unavailability.
Root cause: Your NodePool uses an aggressive consolidation policy (WhenEmptyOrUnderutilized + consolidateAfter: 30s + 30 % disruption budget). NAP consolidates under-utilized nodes much more quickly than Cluster Autoscaler, which triggers more frequent node drains/evictions. During a high-traffic load test, this leads to brief interruptions in your application endpoints.
Link : https://learn.microsoft.com/en-us/azure/aks/node-auto-provisioning-disruption
This page explains exactly how consolidation, budgets, and consolidateAfter work and why aggressive settings can increase pod disruptions on bursty or load-tested workloads.
Steps to workout :
Make consolidation less aggressive (biggest impact) Update your NodePool with these changes:
- Switch to consolidationPolicy: WhenEmpty (only consolidate truly empty nodes).
- Increase consolidateAfter to at least 5m (or longer for production-like tests).
- Tighten the budget to 10 % (or use a time-based schedule to block disruptions during test windows).
Example updated snippet (replace the disruption section):
disruption:
budgets:
- nodes: 10%
consolidateAfter: 5m
consolidationPolicy: WhenEmpty
After applying (kubectl apply -f general-purpose.yaml), monitor with:
kubectl get events --field-selector reason=Disruption,reason=Consolidation --sort-by=.metadata.creationTimestamp
Protect your application with a PodDisruptionBudget (PDB) Add a PDB to your sample app Deployment so NAP cannot evict too many pods at once. Example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-service-pdb
spec:
minAvailable: 80% # or a fixed number, e.g. 5
selector:
matchLabels:
app: web-service # your app's label
- Apply it before re-running the test. (Note: NAP respects PDBs for most voluntary disruptions.)
- Quick validation steps
- Run kubectl get nodes -l karpenter.sh/nodepool=general-purpose -o wide during the test to watch node churn.
- Check Karpenter controller logs for consolidation decisions:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100 | grep -i "consolidat\|disrupt" - If you still see network errors after the above, share the output of the events command + any pod logs from your app pods—we’ll dig deeper immediately.
- Optional – Broaden your instance-type options Your current requirements list only allows a handful of small D-series VMs. This works, but for hundreds of nodes it can create more disruption points. Consider relaxing to the full D-family (karpenter.azure.com/sku-family: D) unless you have a specific reason to pin exact SKUs.
These changes keep NAP’s speed and cost benefits while dramatically reducing the error rate you saw versus CA. Many customers run exactly this pattern: NAP in lower environments + tuned disruption settings for production safety.
Apply the NodePool + PDB updates, re-run your load test, and let me know the new error count (or paste any new events/logs). We’ll iterate until it’s rock-solid for your peak-day scaling needs.
Thanks,
Manish.