AKS GPU Node Autoscaling Delay for vLLM LLM Workloads on A100 Nodes

Question

AKS GPU Node Autoscaling Delay for vLLM LLM Workloads on A100 Nodes

Vishnupriya S 20

I am trying to implement autoscaling for an AKS-based LLM inference workload using vLLM, where each replica serves a GPT OSS 120B model using tensor parallelism (tensor-parallel-size=4) across 4 A100 GPUs (Standard_NC96ads_A100_v4).

Current setup:

AKS Cluster Autoscaler enabled
KEDA-based autoscaling
GPU nodepool using Standard_NC96ads_A100_v4
Each pod requests all 4 GPUs on a node
Min nodes: 1
Max nodes: 4

Issue: During scale-out, provisioning a new GPU node followed by pod startup takes nearly 10 minutes end-to-end.

I have already optimized Cluster Autoscaler settings:

scanInterval=10s
newPodScaleUpDelay=0s
expander=least-waste

Observations:

Each new replica requires provisioning an entirely new A100 node because the workload consumes all 4 GPUs on the node.
Existing Kubernetes/autoscaler tuning does not significantly reduce the total scaling time.
Model weights are currently loaded from persistent storage during pod startup.

Questions:

Is ~5–10 minute provisioning time expected for Standard_NC96ads_A100_v4 node autoscaling in AKS?
Are there any AKS/VMSS/Azure infrastructure optimizations that can significantly reduce GPU node provisioning latency?
Has anyone implemented faster autoscaling patterns for large vLLM/GPU inference workloads on AKS?
Is maintaining warm GPU capacity (minimum ready nodes) the recommended production approach for these workloads?
Would local NVMe model caching materially reduce startup latency compared to Blob/PVC-based loading?

Any recommendations or production best practices for reducing end-to-end autoscaling time for large LLM inference workloads on AKS would be greatly appreciated.I am trying to implement autoscaling for an AKS-based LLM inference workload using vLLM, where each replica serves a GPT OSS 120B model using tensor parallelism (tensor-parallel-size=4) across 4 A100 GPUs (Standard_NC96ads_A100_v4).

Current setup:

AKS Cluster Autoscaler enabled
KEDA-based autoscaling
GPU nodepool using Standard_NC96ads_A100_v4
Each pod requests all 4 GPUs on a node
Min nodes: 1
Max nodes: 4

Issue:
During scale-out, provisioning a new GPU node followed by pod startup takes nearly 10 minutes end-to-end.

I have already optimized Cluster Autoscaler settings:

scanInterval=10s
newPodScaleUpDelay=0s
expander=least-waste

Cluster Autoscaler reacts within seconds, but most delay appears to occur during:

VMSS scale-out
GPU VM provisioning
node bootstrap and GPU initialization
model loading

Observations:

Each new replica requires provisioning an entirely new A100 node because the workload consumes all 4 GPUs on the node.
Existing Kubernetes/autoscaler tuning does not significantly reduce the total scaling time.
Model weights are currently loaded from persistent storage during pod startup.

Questions:

Is ~5–10 minute provisioning time expected for Standard_NC96ads_A100_v4 node autoscaling in AKS?
Are there any AKS/VMSS/Azure infrastructure optimizations that can significantly reduce GPU node provisioning latency?
Has anyone implemented faster autoscaling patterns for large vLLM/GPU inference workloads on AKS?
Is maintaining warm GPU capacity (minimum ready nodes) the recommended production approach for these workloads?
Would local NVMe model caching materially reduce startup latency compared to Blob/PVC-based loading?

Any recommendations or production best practices for reducing end-to-end autoscaling time for large LLM inference workloads on AKS would be greatly appreciated.

Vishnupriya S 20 Reputation points

2026-05-12T14:37:26.51+00:00

Follow-up Question:

Would adopting RayServe for LLM serving help improve autoscaling responsiveness or reduce cold-start impact for large GPU-based inference workloads (GPT-OSS-120B/vLLM) on AKS? Specifically, can RayServe provide better workload orchestration, faster replica recovery, pre-warmed worker management, or more efficient scaling behavior compared to the current KEDA + vLLM deployment model?

Answer accepted by question author

Himanshu Shekhar 6,420 Microsoft External Staff Moderator

Vishnupriya S - 5–10 min scale‑out latency for A100 (Standard_NC96ads_A100_v4) on AKS is expected, and for large LLM inference workloads, pre-warmed capacity + model startup optimizations are the recommended production pattern

Is 5–10 min provisioning time expected?

Yes, GPU nodes in AKS are backed by VMSS provisioning + GPU driver + node initialization, which adds overhead beyond CPU nodes.

Real-world observations (including Microsoft Q&A discussions) show ~10–11 minutes before a new GPU node becomes ready after autoscaler triggers scale-out. [Faster aut...rosoft Q&A | Learn.Microsoft.com]

Additionally, GPU SKUs such as Standard_NC96ads_A100_v4 are:

capacity-constrained in many regions, which can add scheduling delays [Allocation...ds_A100_v4
large SKUs (96 vCPU, 4 A100 GPUs), inherently slower to allocate
Can AKS/VMSS tuning significantly reduce provisioning latency?

No major reduction possible at infra layer (design limitation).

Based on AKS design:

Cluster Autoscaler only triggers provisioning quickly, but:
actual node creation depends on Azure Compute capacity + VMSS provisioning

GPUs require:

driver + device plugin initialization before pod scheduling
Proven production patterns for large LLM workloads (MSFT-aligned)

Maintain GPU capacity (recommended)

Keep min nodes > 0 (often > 1 for HA)
Avoid cold-start node provisioning during traffic spikes

Why

Autoscaling nodes ≠ instant
AKS autoscaler guarantees eventual capacity, not immediate

Pattern 2: Separate infra scaling vs workload scaling

Node scaling (slow, minutes) and Pod scaling (fast, seconds via KEDA/HPA) - https://learn.microsoft.com/en-us/azure/aks/best-practices-gpu

Recommendation: Use KEDA only for pod scaling on existing nodes

Pattern 3: Reduce GPU-per-pod coupling (if possible)

Currently > 1 pod = 1 full node (4 GPUs)

This forces: node-per-replica scaling (worst case)

Optimization (if supported): MIG / smaller GPU slicing (A100 supports MIG)

This allows:

faster scale‑out (no full node dependency)
better packing efficiency

Pattern 4: Node pre-provisioning strategies - Always maintain buffer nodes (idle but ready) or use scale-down mode = Deallocate (retain VM allocation state)

NVMe/local disk caching - NC A100 v4 supports local NVMe storage [NC_A100_v4...soft Learn | Learn.Microsoft.com]

Best practice: Cache model weights locally:

avoids repeated Blob/PVC fetch
significantly reduces cold start

Recommendation hierarchy:

NVMe preload (best)
node-level cache (daemonset/init)
PVC/Blob (slowest)

Also use latest AKS GPU node image / managed GPU feature - Managed GPU node pools remove manual setup overhead and ensure GPU is ready faster

Create a fully managed GPU node pool on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/aks-managed-gpu-nodes?tabs=add-ubuntu-gpu-node-pool%2Cmig-single%2Cdriver-only
Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?tabs=add-ubuntu-gpu-node-pool
Use NVIDIA GPU Operator on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/nvidia-gpu-operator
GPU best practices for Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/best-practices-gpu
Use Azure Kubernetes Service to host GPU-based workloads - https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks-gpu/gpu-aks

Vishnupriya S 20 Reputation points

2026-05-13T12:35:00.1166667+00:00

Follow-up Question:

Would adopting RayServe for LLM serving help improve autoscaling responsiveness or reduce cold-start impact for large GPU-based inference workloads (GPT-OSS-120B/vLLM) on AKS? Specifically, can RayServe provide better workload orchestration, faster replica recovery, pre-warmed worker management, or more efficient scaling behavior compared to the current KEDA + vLLM deployment model?

1 additional answer

Your answer

Vishnupriya S 20 Reputation points

2026-05-12T14:37:26.51+00:00

Follow-up Question:

Would adopting RayServe for LLM serving help improve autoscaling responsiveness or reduce cold-start impact for large GPU-based inference workloads (GPT-OSS-120B/vLLM) on AKS? Specifically, can RayServe provide better workload orchestration, faster replica recovery, pre-warmed worker management, or more efficient scaling behavior compared to the current KEDA + vLLM deployment model?
Vishnupriya S 20 Reputation points

2026-05-13T12:35:00.1166667+00:00

Follow-up Question:

Would adopting RayServe for LLM serving help improve autoscaling responsiveness or reduce cold-start impact for large GPU-based inference workloads (GPT-OSS-120B/vLLM) on AKS? Specifically, can RayServe provide better workload orchestration, faster replica recovery, pre-warmed worker management, or more efficient scaling behavior compared to the current KEDA + vLLM deployment model?

Answer 1

Provisioning latency of several minutes for large GPU SKUs like A100 is expected.

The documented guidance for AKS GPU autoscaling explicitly calls out that GPU node provisioning can take several minutes depending on the GPU SKU size:

“Depending on the size of your provisioned GPU SKU, node provisioning might take several minutes.”

For high-end SKUs such as Standard_NC96ads_A100_v4, a 5–10 minute end‑to‑end delay (VMSS scale‑out + node bootstrap + GPU initialization + pod scheduling) is within the expected range.

Platform-level optimizations to reduce GPU node provisioning latency are limited.

AKS Cluster Autoscaler and VMSS will still need to allocate and boot a new GPU VM, install extensions, join the node to the cluster, and initialize GPU drivers. The guidance focuses on:

Enabling Cluster Autoscaler on GPU node pools and tuning min/max counts.
Accepting that GPU node provisioning “might take several minutes” for large SKUs.

There is no documented AKS/VMSS setting that materially shortens the underlying GPU VM provisioning time beyond what is already achieved by the autoscaler reacting quickly.

Node Autoprovisioning (NAP) with Karpenter is available, but it still provisions new nodes based on pending pod requirements; it optimizes SKU selection and capacity efficiency rather than fundamentally reducing boot time.

Faster autoscaling patterns for large GPU workloads focus on event-driven scaling and cost control, not instant scale-out.

Recommended patterns include:

Using KEDA for event-driven autoscaling so pods scale with actual demand (queue length, HTTP traffic, or Azure Monitor metrics). KEDA can scale workloads down to 0 replicas, which is particularly useful for sporadic GPU workloads.
Combining KEDA with Cluster Autoscaler so that:
- KEDA scales pods based on events/metrics.
- Cluster Autoscaler scales GPU node pools when pods are unschedulable.

For GPU workloads, AKS guidance emphasizes:

Proper GPU configuration and continuous validation to avoid misconfigurations that cause failures or underutilization.
Using GPU metrics (for example via NVIDIA DCGM exporter + Azure Managed Prometheus + KEDA) to autoscale based on real-time GPU utilization, which helps optimize usage and control operational costs.

These patterns improve responsiveness and cost efficiency but do not eliminate the several‑minute provisioning time for new GPU nodes.

Maintaining warm GPU capacity is a recommended production approach.

Cost guidance for GPU workloads on AKS explicitly notes:

GPU-enabled node pools incur idle cost even when no GPU workloads run.
To avoid high idle costs, create node pools only when needed and use features like cluster stop when not running workloads.

However, for latency‑sensitive production inference, the trade-off is usually:

Maintain a minimum number of GPU nodes (warm capacity) to handle baseline and burst traffic without waiting for new nodes to provision.
Use autoscaling (KEDA + Cluster Autoscaler or Node Autoprovisioning) to add additional capacity for larger bursts, accepting that those extra nodes may take several minutes to become ready.

The AKS autoscaling best practices explicitly call out KEDA’s ability to scale to 0 as useful for “periodic machine learning (ML) or GPU workloads” and dev/test scenarios, which implies that for production, keeping some baseline capacity online is often appropriate.

Local NVMe model caching can help with startup latency if model loading is a significant portion of the delay.

The provided context does not give a direct comparison between Blob/PVC-based loading and local NVMe caching for model weights. It does, however, emphasize GPU cost and utilization optimization:

Underuse and overprovisioning are highlighted as key issues.
Techniques like GPU partitioning (multiple-instance GPUs) are recommended to maximize GPU usage and reduce operational costs.

Within that framework, reducing pod startup time by improving data locality (for example, caching models on local NVMe) is consistent with the general goal of improving GPU utilization and reducing idle time. If model loading is a large fraction of the 10‑minute path, local NVMe caching can materially reduce the pod startup portion, even though it cannot eliminate the underlying VM provisioning time.

Summary of practical guidance based on the documentation:

Several minutes of provisioning time for large GPU SKUs is expected and called out in AKS GPU autoscaling guidance.
There are no documented VMSS/AKS knobs that drastically reduce GPU VM provisioning time beyond fast autoscaler reaction.
Use KEDA with GPU metrics (for example DCGM via Azure Managed Prometheus) to autoscale based on real-time GPU utilization.
Combine KEDA with Cluster Autoscaler or Node Autoprovisioning for infrastructure autoscaling.
For production, maintain a minimum number of warm GPU nodes to meet latency SLOs, and rely on autoscaling for additional burst capacity.
Consider techniques like GPU partitioning and better resource sizing (for example vertical pod autoscaler) to avoid overprovisioning and improve cost efficiency; local caching of models fits into this broader optimization strategy.

References:

Share via

AKS GPU Node Autoscaling Delay for vLLM LLM Workloads on A100 Nodes

1 additional answer

Your answer