Troubleshoot high CPU usage in AKS clusters

High CPU usage is a symptom of one or more applications or processes that require so much CPU time that the performance or usability of the machine is impacted. High CPU usage can occur in many ways, but it's mostly caused by user configuration.

When a node in an Azure Kubernetes Service (AKS) cluster experiences high CPU usage, the applications running on it can experience degradation in performance and reliability. Applications or processes also become unstable, which may lead to issues beyond slow responses.

This article helps you identify the nodes and containers that consume high CPU and provides best practices to resolve high CPU usage.

Symptoms

The following table outlines the common symptoms of high CPU usage:

Symptom Description
CPU starvation CPU-intensive applications slow down other applications on the same node.
Slow state changes Pods may take longer to get ready.
NotReady node state A node enters the NotReady state. This issue occurs because the container with high CPU usage causes the Kubectl command line tool to be unresponsive.

Troubleshooting checklist

To resolve high CPU usage, use effective monitoring tools and apply best practices.

Step 1: identify nodes/containers with high CPU usage

Use either of the following methods to identify nodes and containers with high CPU usage:

  • In a web browser, use the Container Insights feature of AKS in the Azure portal.

  • In a console, use the Kubernetes command-line tool (kubectl).

Container Insights is a feature within AKS. It's designed to monitor the performance of container workloads. You can use Container insights to identify nodes, containers, or pods that drive high CPU usage.

To identify nodes, containers, or pods that drive high CPU usage, follow these steps:

  1. Navigate to the cluster from the Azure portal.

  2. Under Monitoring, select Insights.

    Screenshot of the Monitoring under Insights

  3. Set the appropriate Time range.

    Screenshot of a time range of six hours.

  4. Locate the nodes with high CPU usage and check if the node CPU usage is stable.

    Select Nodes. Set Metric to CPU Usage (millicores) and then set the sample to Max. Use the sort feature on the Max to order the nodes by Max%. The nodes with the highest CPU usage appear at the top.

    In the following screenshot, the node only uses 12% of the max CPU and has been running for 16 days.

    Screenshot of the Nodes under the Monitoring selection.

  5. Once you locate the nodes with high CPU usage, select the nodes to find pods on them and their CPU usage.

    Screenshot of the insights option for pods under the Monitoring selection.

    Note

    The percentage of CPU or memory usage for pods is based on the CPU request specified for the container. It doesn't represent the percentage of the CPU or memory usage for the node. So, look at the actual CPU or memory usage rather than the percentage of CPU or memory usage for pods.

    Once you get the list of pods with high CPU usage, you can map it to the applications that cause the spike in CPU usage.

Step 2: Review best practices to avoid high CPU usage

Review the following table to learn how to implement best practices for avoiding high CPU usage:

Best practice Description
Set appropriate limits for containers Kubernetes allows specifying requests and limits on the resources for containers. Resource requests and limits represent the minimum and maximum number of resources a container can use. We recommend you set appropriate requests and limits to choose the appropriate Kubernetes Quality of Service (QoS) class for each pod.
Enable Horizontal Pod Autoscaler (HPA) Setting appropriate limits along with enabling HPA can help in resolving high CPU usage.
Select higher SKU VMs To handle high CPU workloads, use higher SKU VMs. To do this, create a new node pool, cordon off the nodes to make them unschedulable, and drain the existing node pool.
Isolate system and user workloads We recommend that you create a separate node pool (other than the agent pool) to run your workloads. This can prevent overloading the system node pool and provide better performance.

References

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.