Optimize Azure Kubernetes Service (AKS) usage and costs

2025-02-27

This article provides guidance on how to optimize your Azure Kubernetes Service (AKS) usage and costs. It covers guidance on the following topics:

Automatic scaling
Cluster right-sizing
GPU optimizations
Multitenancy
Azure discounts

Automatic scaling

Horizontal pod autoscaling

The Horizontal Pod Autoscaler (HPA) monitors resource demand and automatically updates a workload resource to automatically scale the number of pods to match demand. The response to increased load is to deploy more pods. If the load decreases and the number of pods is above the configured minimum, the autoscaler tells the workload resource to scale down.

The Metrics API gets data from the kubelet every 60 seconds, and the HPA checks the Metrics API every 15 seconds for any needed changes by default. This means that the HPA updates every 60 seconds. When you configure the HPA for a deployment, you define the minimum and maximum number of replicas that can run and the metrics that the HPA uses to determine when to scale.

For more information, see Horizontal Pod Autoscaling and Autoscale pods in AKS.

Kubernetes event-driven autoscaling

The Kubernetes Event-driven Autoscaler (KEDA) applies event-driven autoscaling to your workloads. KEDA works with the HPA and can extend functionality without overwriting or duplication.

You can use the KEDA add-on for AKS to scale your applications and leverage a rich catalog of Azure KEDA scalers. For more information, see Application autoscaling with the KEDA add-on and Install the KEDA add-on for AKS.

Vertical pod autoscaling

The Vertical Pod Autoscaler (VPA) automatically sets resource requests and limits on containers per workload based on past usage. The VPA frees up CPU and Memory for pods to ensure effective utilization of your AKS clusters. Over time, the VPA provides recommendations for resource usage.

For more information, see Vertical pod autoscaling in Azure Kubernetes Service (AKS) and Use the Vertical Pod Autoscaler (VPA) in Azure Kubernetes Service (AKS).

Cluster right-sizing

Right-size your cluster

It's important to right-size your clusters to optimize costs and performance. You can manually resize a cluster by adding or removing the nodes to meet the needs of your applications. You can also autoscale your cluster to automatically adjust the number of nodes in response to changing demands.

For more information, see Resize Azure Kubernetes Service (AKS) clusters.

Cluster autoscaling

With the cluster autoscaler, you can automatically scale node pools based on resource usage and constraints, such as scaling up to schedule pending pods or scaling down to reduce costs for unused nodes. The cluster autoscaler profile is a set of parameters that you can fine-tune to control the behavior of the cluster autoscaler.

For more information, see Cluster autoscaling in Azure Kubernetes Service (AKS) overview and Use the cluster autoscaler in Azure Kubernetes Service (AKS).

Node autoprovisioning (preview)

Node autoprovisioning (NAP) (preview), based on the open-source Karpenter project, helps you provision the right infrastructure based on the pending pod resource requirements of your workloads. With efficient bin-packing, you can consolidate your workloads onto the right-sized infrastructure to reduce operating costs.

For more information, see Node autoprovisioning (preview) in Azure Kubernetes Service (AKS).

GPU optimizations

GPU partitioning helps combat underutilization by splitting up or sharing GPUs across multiple workloads. The following sections cover different ways to partition and share GPUs in AKS.

Time-slicing

The NVIDIA GPU Operator enables the time-slicing of GPUs in Kubernetes clusters. With time-slicing, a system administrator can define a set of replicas for a GPU, each of which can be handed out independently to a pod to run workloads on. You can apply cluster-wide default time-slicing configurations and node-specific configurations.

Screenshot of a visual chart example showing GPU time-slicing.

For more information, see Time-slicing GPUs in Kubernetes.

Multi-processing service (MPS)

A single process might not utilize all the memory and compute bandwidth capacity available on a GPU. The Multi-Process Service (MPS) enables logical partitioning of memory and compute resources between workloads and allows kernel and memcopy operations from different processes to overlap on the GPU. MPS helps you achieve higher GPU utilization and shorter running times.

Screenshot of a visual chart example showing GPU multi-process service (MPS).

For more information, see Multi-Process Service (MPS).

Multi-instance GPUs (MIGs)

Multi-instance GPUs (MIGs) enable you to partition GPUs based on the NVIDIA Ampere and later architectures into separate and secure GPU instances for CUDA applications.

Screenshot of a visual chart example showing multi-instance GPUs (MIGs).

For more information, see GPU Operator with MIG and Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS).

Multitenancy

Multitenancy refers to the sharing of infrastructure across tenants, teams, and business units. The following table outlines different ways to implement multitenancy in AKS:

Multitenancy type	Multitenancy level	Cluster pod density	Cost allocation	Ideal use case	Potential risks
Dedicated cluster	Hard multitenancy	Lower	Easiest	Complete security isolation boundaries and straightforward cost allocation	• Cluster sprawl at scale adds to management overhead costs • Lower pod density and more overprovisioned resources
Dedicated node pool	Soft multitenancy	Medium	Medium	Medium pod density	• Requires trust between tenants • Requires extra cluster configurations, like network policies, quota management, role-based access control (RBAC), etc.
Dedicated namespace	Soft multitenancy	Higher	Harder	Sharing infrastructure to maximize resource utilization	• Unsafe for hostile environments by default • Requires extra cluster configurations, like network policies, quota management, role-based access control (RBAC), etc.

Dedicated cluster

With dedicated cluster multitenancy, clusters are dedicated to a single workload or team.

Screenshot of a visual chart example showing dedicated cluster multitenancy.

The following table outlines pros and cons of using a dedicated cluster:

Pros	Cons
• Easier isolation method • Straightforward cost allocation and chargeback • Great for cases where tenants don't trust each other (often from security and resource sharing perspectives)	• High management and financial overhead • Generally low pod density and overprovisioned resources

Dedicated node pool

With dedicated node pool multitenancy, clusters are shared by many tenants.

Screenshot of a visual chart example showing dedicated node pool multitenancy.

The following table outlines pros and cons of using a dedicated node pool:

Pros	Cons
• Medium pod density • Some shared infrastructure • Apply Azure tags to node pools dedicated to a single tenant (tags propagate to nodes and persist through upgrades)	• Requires trust between the tenants • Requires extra cluster configurations, like network policies, quota management, role-based access control (RBAC), etc.

Dedicated namespace

With dedicated namespace multitenancy, clusters are shared by many tenants, with namespaces serving as the isolation boundary.

Screenshot of a visual chart example showing dedicated namespace multitenancy.

The following table outlines pros and cons of using a dedicated namespace:

Pros	Cons
• Higher pod density • Best binpacking • Sharing infrastructure to maximize resource utilization	• Unsafe for hostile environments by default • Requires extra security measures in place if all tenants can't be trusted

Azure discounts

To take savings one step further, take advantage of Azure discounts such as Azure Savings Plans, Reserved Instances, and Azure Hybrid Benefits.

Azure discount type	Details
Azure Savings Plans	• 1-3 year upfront commitment • Save up to 65% compared to pay-as-you-go • Flexible, with no SKU family or region restrictions • Best for workloads with consistent costs with resources in various SKUs and regions
Reserved Instances	• 1-3 year upfront commitment • Save up to 72% compared to pay-as-you-go • Restricted to specific SKU families and regions • Best for stable workloads running continuously (with no unexpected SKU or region changes)
Azure Hybrid Benefits	• Bring your own on-premises Windows Server and SQL Server licenses to Azure • Use any qualifying on-premises licenses that have an active Software Assurance (SA) or qualifying subscription

Next steps

To learn more about cost in AKS, see the following articles: