An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Vishnupriya S - 5–10 min scale‑out latency for A100 (Standard_NC96ads_A100_v4) on AKS is expected, and for large LLM inference workloads, pre-warmed capacity + model startup optimizations are the recommended production pattern
- Is 5–10 min provisioning time expected?
Yes, GPU nodes in AKS are backed by VMSS provisioning + GPU driver + node initialization, which adds overhead beyond CPU nodes.
Real-world observations (including Microsoft Q&A discussions) show ~10–11 minutes before a new GPU node becomes ready after autoscaler triggers scale-out. [Faster aut...rosoft Q&A | Learn.Microsoft.com]
Additionally, GPU SKUs such as Standard_NC96ads_A100_v4 are:
- capacity-constrained in many regions, which can add scheduling delays [Allocation...ds_A100_v4
- large SKUs (96 vCPU, 4 A100 GPUs), inherently slower to allocate
- Can AKS/VMSS tuning significantly reduce provisioning latency?
No major reduction possible at infra layer (design limitation).
Based on AKS design:
- Cluster Autoscaler only triggers provisioning quickly, but:
- actual node creation depends on Azure Compute capacity + VMSS provisioning
GPUs require:
- driver + device plugin initialization before pod scheduling
- Proven production patterns for large LLM workloads (MSFT-aligned)
Maintain GPU capacity (recommended)
- Keep min nodes > 0 (often > 1 for HA)
- Avoid cold-start node provisioning during traffic spikes
Why
- Autoscaling nodes ≠ instant
- AKS autoscaler guarantees eventual capacity, not immediate
Pattern 2: Separate infra scaling vs workload scaling
Node scaling (slow, minutes) and Pod scaling (fast, seconds via KEDA/HPA) - https://learn.microsoft.com/en-us/azure/aks/best-practices-gpu
Recommendation: Use KEDA only for pod scaling on existing nodes
Pattern 3: Reduce GPU-per-pod coupling (if possible)
Currently > 1 pod = 1 full node (4 GPUs)
This forces: node-per-replica scaling (worst case)
Optimization (if supported): MIG / smaller GPU slicing (A100 supports MIG)
This allows:
- faster scale‑out (no full node dependency)
- better packing efficiency
Pattern 4: Node pre-provisioning strategies - Always maintain buffer nodes (idle but ready) or use scale-down mode = Deallocate (retain VM allocation state)
NVMe/local disk caching - NC A100 v4 supports local NVMe storage [NC_A100_v4...soft Learn | Learn.Microsoft.com]
Best practice: Cache model weights locally:
- avoids repeated Blob/PVC fetch
- significantly reduces cold start
Recommendation hierarchy:
- NVMe preload (best)
- node-level cache (daemonset/init)
- PVC/Blob (slowest)
Also use latest AKS GPU node image / managed GPU feature - Managed GPU node pools remove manual setup overhead and ensure GPU is ready faster
- Create a fully managed GPU node pool on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/aks-managed-gpu-nodes?tabs=add-ubuntu-gpu-node-pool%2Cmig-single%2Cdriver-only
- Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?tabs=add-ubuntu-gpu-node-pool
- Use NVIDIA GPU Operator on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/nvidia-gpu-operator
- GPU best practices for Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/best-practices-gpu
- Use Azure Kubernetes Service to host GPU-based workloads - https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks-gpu/gpu-aks