Intermittent Startup Delay in AKS Pod When Using Managed Identity & Specific CPU Configurations

Question

Intermittent Startup Delay in AKS Pod When Using Managed Identity & Specific CPU Configurations

Manakkal. Subash 0

I am running a monolithic application in Azure Kubernetes Service (AKS) as a single replica. The container image is based on Debian OS, and the AKS cluster consists of one node (D8s_v3, 8 CPUs, 32GB RAM).

The application is tightly coupled with an Azure SQL Serverless database and authenticates using Managed Identity (federation via Workload Identity). The pod also has a Persistent Volume (PV) using Azure Disk as the storage class.

Issue: Startup Delay & Restart Behavior

Pod resource configuration:

CPU Request: 2 | CPU Limit: 4

Memory Request: 8GB | Memory Limit: 10GB

When using this configuration, the application startup is delayed, and the pod restarts after 30 minutes (startup probe failure).

Observed behavior with different CPU configurations:

App starts successfully in ~6-7 minutes when:

CPU Request: 2 | CPU Limit: 2

CPU Request: 1 | CPU Limit: 2

CPU Request: 4 or 5 | CPU Limit: not set

App experiences startup delay & restarts when:

CPU Request: 3 | CPU Limit: 4

CPU Request: 4 | CPU Limit: 4, 5, or 6

No other containers are running on this pod or node.

Thread Dump Observations:

When the startup delay occurs, I see blocked or waiting threads related to Managed Identity authentication.

When the app starts fine, no such waiting or blocked threads are observed.

Questions:

Could this inconsistent startup behavior be related to CPU allocation, throttling, or scheduling in AKS?
Is there any known impact of CPU request/limit values on Managed Identity token retrieval in AKS?
Any debugging recommendations (e.g., AKS logs, Managed Identity diagnostics) to further investigate why authentication threads are blocked in certain CPU configurations?

Would appreciate any insights! Thanks in advance.

Mounika Reddy Anumandla 6,845 Reputation points Microsoft External Staff Moderator

2025-02-14T04:08:51.2133333+00:00
Hi Manakkal. Subash,

Welcome to Microsoft Q&A Platform. Thank you for posting your query here.

When you define CPU limits, Kubernetes uses CFS (Completely Fair Scheduler) quotas to enforce those limits. If the application requests CPU more aggressively (e.g., 3 or 4 CPUs), but the CPU quota enforcement limits it, the application might experience unexpected pauses. When CPU limits are set (especially close to requests), Kubernetes throttles CPU usage, potentially delaying MSI authentication threads.

Without a CPU limit, the pod can use as much CPU as available, reducing startup delays. MSI token retrieval is time sensitive. As per my understanding, If CPU is throttled, MSI requests may timeout or get blocked, causing startup failures.

As you have mentioned:

App starts successfully in ~6-7 minutes when:

CPU Request: 2 | CPU Limit: 2 ---->Since request = limit, the CFS quota system does not throttle the pod. The app starts normally since it gets a consistent CPU allocation.2 vCPUs might not be enough to handle high startup load efficiently. MSI token retrieval & network calls may take some CPU cycles.

CPU Request: 1 | CPU Limit: 2--->The app gets at least 1 vCPU guaranteed but can burst up to 2 vCPUs. Initial processing with only 1 guaranteed vCPU could slow down some tasks.

CPU Request: 4 or 5 | CPU Limit: not set---->No CPU limit means no CFS quota enforcement → No artificial throttling. With 4+ vCPUs, the app has sufficient CPU resources.

https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#if-you-do-not-specify-a-cpu-limit

https://github.com/robusta-dev/alert-explanations/wiki/CPUThrottlingHigh-(Prometheus-Alert)#why-cpu-throttling-can-occur-despite-low-cpu-usage-permalink

Recommended: Reduce CPU requests and remove CPU limit to allow the pod to scale its CPU usage as required.

If the startup delay persists despite increasing CPU allocation, consider increasing the startup probe timeout to account for longer token retrieval times.

Enable MSI Token Debugging Logs and check MSI token retrieval time. If slow, the issue is MSI authentication delays, not CPU.

AKS metrics can give insights into CPU utilization, node resource utilization, and if there are any resource contention issues. https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/availability-performance/identify-high-cpu-consuming-containers-aks?tabs=browser

This article helps you understand how to use Azure Monitor to help you quickly assess, investigate, and resolve detected issues.
https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-analyzeIf you have enabled Azure Monitor for Container when you created your cluster, the logs of your application will be pushed to a Log Analytics workspace in the table ContainerLog. If Azure Monitor is not enable, you can use kubectl to see what is output to stdout and sdterr with the following command:

kubectl logs {pod-name} -n {namespace}

You can also check the kubernetes events, you'll see events saying that the probes failed If this is really the problem :

kubectl get events -n {namespace}

Hope this helps!

Let me know if you have any further queries!
Manakkal. Subash 0 Reputation points

2025-02-14T04:28:26.42+00:00

Thanks for your input. There are other critical workloads running on the cluster. Wouldn’t removing the cpu limits jeopardize the other applications where there is a contention?
Manakkal. Subash 0 Reputation points

2025-02-14T13:45:59.95+00:00

Yep. We have these numbers. Again, thank you for your valuable inputs. We will look into enabling MSI token logs

1 answer

Your answer

Manakkal. Subash 0 Reputation points

2025-02-14T04:28:26.42+00:00

Thanks for your input. There are other critical workloads running on the cluster. Wouldn’t removing the cpu limits jeopardize the other applications where there is a contention?
Manakkal. Subash 0 Reputation points

2025-02-14T13:45:59.95+00:00

Yep. We have these numbers. Again, thank you for your valuable inputs. We will look into enabling MSI token logs

Answer 1

Mounika Reddy Anumandla 6,845 Microsoft External Staff Moderator

Hi Manakkal. Subash,

Thank you for replying back with further information.

As there are other critical workloads running on the cluster, removing CPU limits entirely could jeopardize other workloads during contention because Kubernetes allows unlimited CPU usage when limits are not set.
Instead of removing CPU limits, try adjusting them properly by setting a slightly higher limit. This ensures consistent CPU allocation without throttling. This also allows occasional bursts while still preventing excessive resource consumption.

Checking the throttling rate of your pods:

Just login to the pod and run cat /sys/fs/cgroup/cpu/cpu.stat.

nr_periods — Total schedule period
nr_throttled — Total throttled period out of nr_periods
throttled_time — Total throttled time in ns If nr_throttled is high, it means your pod is hitting the CPU limit frequently. You can try considering a Larger Node (D16s_v3) → More room to distribute workloads. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#troubleshooting

Please feel free to tag me in the comments for further assistance.

Mounika Reddy Anumandla 6,845 Reputation points Microsoft External Staff Moderator

2025-02-14T16:17:39.5733333+00:00

Hi Manakkal. Subash,

Hope the information provided is helpful to you. I request you to please consider accepting answer.

Accepted answer will help other community members navigate to the appropriate solutions.

Thank you!
Mounika Reddy Anumandla 6,845 Reputation points Microsoft External Staff Moderator

2025-02-17T00:43:08.7766667+00:00

Hi Manakkal. Subash,

I would request you to kindly please consider accepting it as an answer and do a thumbs up at “Was it helpful”. This in turn will benefit other community members with similar scenario navigate better to right solution.

If you have any further concerns, please do not hesitate to contact us. We are pleased to help you.

I look forward to your response and appreciate your time on this.

Share via

Intermittent Startup Delay in AKS Pod When Using Managed Identity & Specific CPU Configurations

1 answer

Your answer