Troubleshoot high CPU usage in AKS clusters

Alt
08/30/2024

High CPU usage is a symptom of one or more applications or processes that require so much CPU time that the performance or usability of the machine is impacted. High CPU usage can occur in many ways, but it's mostly caused by user configuration.

When a node in an Azure Kubernetes Service (AKS) cluster experiences high CPU usage, the applications running on it can experience degradation in performance and reliability. Applications or processes also become unstable, which may lead to issues beyond slow responses.

This article helps you identify the nodes and containers that consume high CPU and provides best practices to resolve high CPU usage.

Symptoms

The following table outlines the common symptoms of high CPU usage:

Symptom	Description
CPU starvation	CPU-intensive applications slow down other applications on the same node.
Slow state changes	Pods may take longer to get ready.
NotReady node state	A node enters the NotReady state. This issue occurs because the container with high CPU usage causes the Kubectl command line tool to be unresponsive.

Troubleshooting checklist

To resolve high CPU usage, use effective monitoring tools and apply best practices.

Step 1: identify nodes/containers with high CPU usage

Use either of the following methods to identify nodes and containers with high CPU usage:

In a web browser, use the Container Insights feature of AKS in the Azure portal.
In a console, use the Kubernetes command-line tool (kubectl).

Browser
Command Line

Container Insights is a feature within AKS. It's designed to monitor the performance of container workloads. You can use Container insights to identify nodes, containers, or pods that drive high CPU usage.

To identify nodes, containers, or pods that drive high CPU usage, follow these steps:

Navigate to the cluster from the Azure portal.
Under Monitoring, select Insights.
Set the appropriate Time range.
Locate the nodes with high CPU usage and check if the node CPU usage is stable.

Select Nodes. Set Metric to CPU Usage (millicores) and then set the sample to Max. Use the sort feature on the Max to order the nodes by Max%. The nodes with the highest CPU usage appear at the top.

In the following screenshot, the node only uses 12% of the max CPU and has been running for 16 days.
Once you locate the nodes with high CPU usage, select the nodes to find pods on them and their CPU usage.

Note

The percentage of CPU or memory usage for pods is based on the CPU request specified for the container. It doesn't represent the percentage of the CPU or memory usage for the node. So, look at the actual CPU or memory usage rather than the percentage of CPU or memory usage for pods.

Once you get the list of pods with high CPU usage, you can map it to the applications that cause the spike in CPU usage.

Note

This method can only be used to diagnose high CPU usage at the current time.

Use the kubectl top node command to get the CPU usage of all nodes.

Get the list of pods running on the node and their CPU usage by running the following command. Replace the node_name with the actual node name.

kubectl get pods --all-namespaces -o wide | grep <node_name> | awk '{print $1" "$2}' | xargs -n2 kubectl top pods --no-headers --namespace | sort -t ' ' --key 2 --numeric --reverse

Check the requests and limits for each pod on the node with the Kubectl describe node <node_name> command.

Note

The percentage of CPU or memory usage for the node is based on the allocatable resources on the node rather than the actual node capacity.

After you identify the pods that use excessive CPU, you can identify the applications running on the pods.

Step 2: Review best practices to avoid high CPU usage

Review the following table to learn how to implement best practices for avoiding high CPU usage:

Best practice	Description
Set appropriate limits for containers	Kubernetes allows specifying requests and limits on the resources for containers. Resource requests and limits represent the minimum and maximum number of resources a container can use. We recommend you set appropriate requests and limits to choose the appropriate Kubernetes Quality of Service (QoS) class for each pod.
Enable Horizontal Pod Autoscaler (HPA)	Setting appropriate limits along with enabling HPA can help in resolving high CPU usage.
Select higher SKU VMs	To handle high CPU workloads, use higher SKU VMs. To do this, create a new node pool, cordon off the nodes to make them unschedulable, and drain the existing node pool.
Isolate system and user workloads	We recommend that you create a separate node pool (other than the agent pool) to run your workloads. This can prevent overloading the system node pool and provide better performance.

References

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.

Comhroinn trí