Share via

AKS GPU Node Autoscaling Delay for vLLM LLM Workloads on A100 Nodes

Vishnupriya S 20 Reputation points
2026-05-11T15:06:30.3633333+00:00

I am trying to implement autoscaling for an AKS-based LLM inference workload using vLLM, where each replica serves a GPT OSS 120B model using tensor parallelism (tensor-parallel-size=4) across 4 A100 GPUs (Standard_NC96ads_A100_v4).

Current setup:

  • AKS Cluster Autoscaler enabled
  • KEDA-based autoscaling
  • GPU nodepool using Standard_NC96ads_A100_v4
  • Each pod requests all 4 GPUs on a node
  • Min nodes: 1
  • Max nodes: 4

Issue: During scale-out, provisioning a new GPU node followed by pod startup takes nearly 10 minutes end-to-end.

I have already optimized Cluster Autoscaler settings:

  • scanInterval=10s
  • newPodScaleUpDelay=0s
  • expander=least-waste

Observations:

  • Each new replica requires provisioning an entirely new A100 node because the workload consumes all 4 GPUs on the node.
  • Existing Kubernetes/autoscaler tuning does not significantly reduce the total scaling time.
  • Model weights are currently loaded from persistent storage during pod startup.

Questions:

  1. Is ~5–10 minute provisioning time expected for Standard_NC96ads_A100_v4 node autoscaling in AKS?
  2. Are there any AKS/VMSS/Azure infrastructure optimizations that can significantly reduce GPU node provisioning latency?
  3. Has anyone implemented faster autoscaling patterns for large vLLM/GPU inference workloads on AKS?
  4. Is maintaining warm GPU capacity (minimum ready nodes) the recommended production approach for these workloads?
  5. Would local NVMe model caching materially reduce startup latency compared to Blob/PVC-based loading?

Any recommendations or production best practices for reducing end-to-end autoscaling time for large LLM inference workloads on AKS would be greatly appreciated.I am trying to implement autoscaling for an AKS-based LLM inference workload using vLLM, where each replica serves a GPT OSS 120B model using tensor parallelism (tensor-parallel-size=4) across 4 A100 GPUs (Standard_NC96ads_A100_v4).

Current setup:

  • AKS Cluster Autoscaler enabled
  • KEDA-based autoscaling
  • GPU nodepool using Standard_NC96ads_A100_v4
  • Each pod requests all 4 GPUs on a node
  • Min nodes: 1
  • Max nodes: 4

Issue:
During scale-out, provisioning a new GPU node followed by pod startup takes nearly 10 minutes end-to-end.

I have already optimized Cluster Autoscaler settings:

  • scanInterval=10s
  • newPodScaleUpDelay=0s
  • expander=least-waste

Cluster Autoscaler reacts within seconds, but most delay appears to occur during:

  • VMSS scale-out
  • GPU VM provisioning
  • node bootstrap and GPU initialization
  • model loading

Observations:

  • Each new replica requires provisioning an entirely new A100 node because the workload consumes all 4 GPUs on the node.
  • Existing Kubernetes/autoscaler tuning does not significantly reduce the total scaling time.
  • Model weights are currently loaded from persistent storage during pod startup.

Questions:

  1. Is ~5–10 minute provisioning time expected for Standard_NC96ads_A100_v4 node autoscaling in AKS?
  2. Are there any AKS/VMSS/Azure infrastructure optimizations that can significantly reduce GPU node provisioning latency?
  3. Has anyone implemented faster autoscaling patterns for large vLLM/GPU inference workloads on AKS?
  4. Is maintaining warm GPU capacity (minimum ready nodes) the recommended production approach for these workloads?
  5. Would local NVMe model caching materially reduce startup latency compared to Blob/PVC-based loading?

Any recommendations or production best practices for reducing end-to-end autoscaling time for large LLM inference workloads on AKS would be greatly appreciated.

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


Answer accepted by question author

Himanshu Shekhar 6,420 Reputation points Microsoft External Staff Moderator
2026-05-11T15:39:13.67+00:00

Vishnupriya S - 5–10 min scale‑out latency for A100 (Standard_NC96ads_A100_v4) on AKS is expected, and for large LLM inference workloads, pre-warmed capacity + model startup optimizations are the recommended production pattern

  1. Is 5–10 min provisioning time expected?

Yes, GPU nodes in AKS are backed by VMSS provisioning + GPU driver + node initialization, which adds overhead beyond CPU nodes.

Real-world observations (including Microsoft Q&A discussions) show ~10–11 minutes before a new GPU node becomes ready after autoscaler triggers scale-out. [Faster aut...rosoft Q&A | Learn.Microsoft.com]

Additionally, GPU SKUs such as Standard_NC96ads_A100_v4 are:

  1. capacity-constrained in many regions, which can add scheduling delays [Allocation...ds_A100_v4
  2. large SKUs (96 vCPU, 4 A100 GPUs), inherently slower to allocate
  3. Can AKS/VMSS tuning significantly reduce provisioning latency?

No major reduction possible at infra layer (design limitation).

Based on AKS design:

  1. Cluster Autoscaler only triggers provisioning quickly, but:
  2. actual node creation depends on Azure Compute capacity + VMSS provisioning

GPUs require:

  1. driver + device plugin initialization before pod scheduling
  2. Proven production patterns for large LLM workloads (MSFT-aligned)

Maintain GPU capacity (recommended)

  1. Keep min nodes > 0 (often > 1 for HA)
  2. Avoid cold-start node provisioning during traffic spikes

Why

  1. Autoscaling nodes ≠ instant
  2. AKS autoscaler guarantees eventual capacity, not immediate

Pattern 2: Separate infra scaling vs workload scaling

Node scaling (slow, minutes) and Pod scaling (fast, seconds via KEDA/HPA) - https://learn.microsoft.com/en-us/azure/aks/best-practices-gpu

Recommendation: Use KEDA only for pod scaling on existing nodes

Pattern 3: Reduce GPU-per-pod coupling (if possible)

Currently > 1 pod = 1 full node (4 GPUs)

This forces: node-per-replica scaling (worst case)

Optimization (if supported): MIG / smaller GPU slicing (A100 supports MIG)

This allows:

  1. faster scale‑out (no full node dependency)
  2. better packing efficiency

Pattern 4: Node pre-provisioning strategies - Always maintain buffer nodes (idle but ready) or use scale-down mode = Deallocate (retain VM allocation state)

NVMe/local disk caching - NC A100 v4 supports local NVMe storage [NC_A100_v4...soft Learn | Learn.Microsoft.com]

Best practice: Cache model weights locally:

  1. avoids repeated Blob/PVC fetch
  2. significantly reduces cold start

Recommendation hierarchy:

  1. NVMe preload (best)
  2. node-level cache (daemonset/init)
  3. PVC/Blob (slowest)

Also use latest AKS GPU node image / managed GPU feature - Managed GPU node pools remove manual setup overhead and ensure GPU is ready faster

  1. Create a fully managed GPU node pool on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/aks-managed-gpu-nodes?tabs=add-ubuntu-gpu-node-pool%2Cmig-single%2Cdriver-only
  2. Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?tabs=add-ubuntu-gpu-node-pool
  3. Use NVIDIA GPU Operator on Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/nvidia-gpu-operator
  4. GPU best practices for Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/best-practices-gpu
  5. Use Azure Kubernetes Service to host GPU-based workloads - https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks-gpu/gpu-aks

Was this answer helpful?

1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. AI answer

    2026-05-11T15:07:06.8633333+00:00
    1. Provisioning latency of several minutes for large GPU SKUs like A100 is expected.

    The documented guidance for AKS GPU autoscaling explicitly calls out that GPU node provisioning can take several minutes depending on the GPU SKU size:

    “Depending on the size of your provisioned GPU SKU, node provisioning might take several minutes.”

    For high-end SKUs such as Standard_NC96ads_A100_v4, a 5–10 minute end‑to‑end delay (VMSS scale‑out + node bootstrap + GPU initialization + pod scheduling) is within the expected range.

    1. Platform-level optimizations to reduce GPU node provisioning latency are limited.

    AKS Cluster Autoscaler and VMSS will still need to allocate and boot a new GPU VM, install extensions, join the node to the cluster, and initialize GPU drivers. The guidance focuses on:

    • Enabling Cluster Autoscaler on GPU node pools and tuning min/max counts.
    • Accepting that GPU node provisioning “might take several minutes” for large SKUs.

    There is no documented AKS/VMSS setting that materially shortens the underlying GPU VM provisioning time beyond what is already achieved by the autoscaler reacting quickly.

    Node Autoprovisioning (NAP) with Karpenter is available, but it still provisions new nodes based on pending pod requirements; it optimizes SKU selection and capacity efficiency rather than fundamentally reducing boot time.

    1. Faster autoscaling patterns for large GPU workloads focus on event-driven scaling and cost control, not instant scale-out.

    Recommended patterns include:

    • Using KEDA for event-driven autoscaling so pods scale with actual demand (queue length, HTTP traffic, or Azure Monitor metrics). KEDA can scale workloads down to 0 replicas, which is particularly useful for sporadic GPU workloads.
    • Combining KEDA with Cluster Autoscaler so that:
      • KEDA scales pods based on events/metrics.
      • Cluster Autoscaler scales GPU node pools when pods are unschedulable.

    For GPU workloads, AKS guidance emphasizes:

    • Proper GPU configuration and continuous validation to avoid misconfigurations that cause failures or underutilization.
    • Using GPU metrics (for example via NVIDIA DCGM exporter + Azure Managed Prometheus + KEDA) to autoscale based on real-time GPU utilization, which helps optimize usage and control operational costs.

    These patterns improve responsiveness and cost efficiency but do not eliminate the several‑minute provisioning time for new GPU nodes.

    1. Maintaining warm GPU capacity is a recommended production approach.

    Cost guidance for GPU workloads on AKS explicitly notes:

    • GPU-enabled node pools incur idle cost even when no GPU workloads run.
    • To avoid high idle costs, create node pools only when needed and use features like cluster stop when not running workloads.

    However, for latency‑sensitive production inference, the trade-off is usually:

    • Maintain a minimum number of GPU nodes (warm capacity) to handle baseline and burst traffic without waiting for new nodes to provision.
    • Use autoscaling (KEDA + Cluster Autoscaler or Node Autoprovisioning) to add additional capacity for larger bursts, accepting that those extra nodes may take several minutes to become ready.

    The AKS autoscaling best practices explicitly call out KEDA’s ability to scale to 0 as useful for “periodic machine learning (ML) or GPU workloads” and dev/test scenarios, which implies that for production, keeping some baseline capacity online is often appropriate.

    1. Local NVMe model caching can help with startup latency if model loading is a significant portion of the delay.

    The provided context does not give a direct comparison between Blob/PVC-based loading and local NVMe caching for model weights. It does, however, emphasize GPU cost and utilization optimization:

    • Underuse and overprovisioning are highlighted as key issues.
    • Techniques like GPU partitioning (multiple-instance GPUs) are recommended to maximize GPU usage and reduce operational costs.

    Within that framework, reducing pod startup time by improving data locality (for example, caching models on local NVMe) is consistent with the general goal of improving GPU utilization and reducing idle time. If model loading is a large fraction of the 10‑minute path, local NVMe caching can materially reduce the pod startup portion, even though it cannot eliminate the underlying VM provisioning time.

    Summary of practical guidance based on the documentation:

    • Several minutes of provisioning time for large GPU SKUs is expected and called out in AKS GPU autoscaling guidance.
    • There are no documented VMSS/AKS knobs that drastically reduce GPU VM provisioning time beyond fast autoscaler reaction.
    • Use KEDA with GPU metrics (for example DCGM via Azure Managed Prometheus) to autoscale based on real-time GPU utilization.
    • Combine KEDA with Cluster Autoscaler or Node Autoprovisioning for infrastructure autoscaling.
    • For production, maintain a minimum number of warm GPU nodes to meet latency SLOs, and rely on autoscaling for additional burst capacity.
    • Consider techniques like GPU partitioning and better resource sizing (for example vertical pod autoscaler) to avoid overprovisioning and improve cost efficiency; local caching of models fits into this broader optimization strategy.

    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.