CPU Pressure and system node pool CPU limits >> 100%

Question

CPU Pressure and system node pool CPU limits >> 100%

FedericoZarelli-6386 60

Hello Team,

Please give me a hand to troubleshoot a CPU Pressure issue I am having on this AKS Kluster:

Kubernetes version: 1.30.3

Nodepools:

system node pools
- Autoscale: True ( enabled after CPU Pressure report )
- Node size: Standard_E2ds_v4
- Taints: CriticalAddonsOnly=true:NoSchedule ( enabled after CPU Pressure report )
- Min nodes: 3 ( increased after CPU Pressure report )
user node pools
Autoscale: True
Node size: Standard_E4ds_v4

For the system node pools, I am using a VM which is smaller than the recommended due to availability in my region and I noticed that the CPU limits on these nodes are 400% and 200%.

Now:

What's the impact on having such high limits? Should I just scale horizontally until limits are within 100%?
These CPU pressure events seems to occur regularly every week on the same day - is there any weekly job been run by the system pool?

Thanks in advance!

Anonymous

2024-11-27T13:56:37.6866667+00:00

Hi FedericoZarelli-6386,

Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

We are reviewing your query regarding troubleshooting a CPU Pressure issue on an AKS cluster and will provide you with an update shortly.
Anonymous

2024-11-28T18:06:55.3766667+00:00

Hi FedericoZarelli-6386,
Just checking in to see if you have got a chance to see the comment posted in resolving the issue.

If the information is helpful, please consider by clicking the Upvote on the post.

Thank you.

1 answer

Your answer

Anonymous

2024-11-27T13:56:37.6866667+00:00

Hi FedericoZarelli-6386,

Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

We are reviewing your query regarding troubleshooting a CPU Pressure issue on an AKS cluster and will provide you with an update shortly.
Anonymous

2024-11-28T18:06:55.3766667+00:00

Hi FedericoZarelli-6386,
Just checking in to see if you have got a chance to see the comment posted in resolving the issue.

If the information is helpful, please consider by clicking the Upvote on the post.

Thank you.

Answer 1

Anonymous

Hi FedericoZarelli-6386,
Thank you for reaching out to us on the Microsoft Q&A forum.
Having high CPU limits such as 400% or 200%, can overcommit node resources and cause performance issues under certain conditions.
Below are some potential impacts:

High CPU limits may result in workloads competing for CPU cycles, particularly when nodes are fully utilized. The Kubernetes CPU scheduler enforces limits, meaning pods that reach their CPU limits will be throttled, leading to degraded performance. Overcommitted CPU resources can waste node capacity during idle times but cause significant performance degradation during peak demand.
Scaling horizontally by adding nodes can help distribute the load, provided the autoscaler responds effectively to CPU pressure. However, scaling alone may not resolve the issue unless resource requests and limits are configured properly.

If high CPU workloads are running on system node pools, consider moving them to a dedicated user node pool. The recurring CPU pressure events in your AKS cluster might be related to a weekly scheduled job or system activity. While AKS itself does not include predefined weekly tasks specifically tied to system node pools.

Use the command "kubectl get cronjobs -A" to identify any weekly CronJobs that might be running at the cluster or namespace level.

Please find the below documents for more information:

If the information is helpful, please consider by clicking the " Accept answer and Upvote " on the post.

Anonymous

2024-11-29T19:44:31.9466667+00:00

Hi FedericoZarelli-6386,
I just wanted to check if you had a chance to review answer. If you found it helpful, could you kindly click the “Accept Answer and upvote” on my post. This will help increase its visibility of this question for other members of the Microsoft Q&A community. Thank you.
FedericoZarelli-6386 60 Reputation points

2024-12-02T13:30:01.2366667+00:00
Hi Sinrud,

Thank you for your answer.

I was already isolating own applications from system node pools so they should not affect each other anymore and also enabled autoscaler - nevertheless, I unfortunately got another CPU pressure event, the autoscaler doesn't seem to be able to react fast enough before the kluster crashes and restarts.

So some addional questions,

Is there a way to control cpu limits of system node pools?

Is my selected SKU too small for AKS?

Is it anyway expected that CPU Pressure events results in the whole cluster to restart?

I have a self hosted prometheus as monitoring but unfortunately it looses all metrics once the container restarts so I am not even able to tell which pods is consuming all the CPU, any recommendation here?
Anonymous

2024-12-03T17:56:13.46+00:00

Hi FedericoZarelli-6386,
Thank you for sharing information!
The Standard_E2ds_v4 VM size with 2 vCPUs and 16 GiB of memory is suitable for lightweight workloads but may not be ideal for system node pools in Azure Kubernetes Service especially in clusters with higher resource demands.

System node pools host essential components like kube-proxy, kubelet, and monitoring agents. For improved stability and better handling of peak loads, it is recommended to use a VM size with at least 4 vCPUs, such as Standard_E4ds_v4.

To ensure Prometheus retains metrics across container restarts, persistent storage must be configured. Without persistent storage, Prometheus stores data in ephemeral storage, which is lost when the container restarts.

If the information is helpful, please consider by clicking the Upvote on the post.
Anonymous

2024-12-05T11:59:27.8866667+00:00

Hi FedericoZarelli-6386,
Just checking in to see if you have got a chance to see the comment posted in resolving the issue.

If the information is helpful, please consider by clicking the Upvote on the post.

Thank you.

Share via

CPU Pressure and system node pool CPU limits >> 100%

1 answer

Your answer