Intermittent Startup Delay in AKS Pod When Using Managed Identity & Specific CPU Configurations

Manakkal. Subash 0 Reputation points
2025-02-13T03:33:40.42+00:00

I am running a monolithic application in Azure Kubernetes Service (AKS) as a single replica. The container image is based on Debian OS, and the AKS cluster consists of one node (D8s_v3, 8 CPUs, 32GB RAM).

The application is tightly coupled with an Azure SQL Serverless database and authenticates using Managed Identity (federation via Workload Identity). The pod also has a Persistent Volume (PV) using Azure Disk as the storage class.

Issue: Startup Delay & Restart Behavior

Pod resource configuration:

CPU Request: 2 | CPU Limit: 4

Memory Request: 8GB | Memory Limit: 10GB

When using this configuration, the application startup is delayed, and the pod restarts after 30 minutes (startup probe failure).

Observed behavior with different CPU configurations:

App starts successfully in ~6-7 minutes when:

CPU Request: 2 | CPU Limit: 2

CPU Request: 1 | CPU Limit: 2

CPU Request: 4 or 5 | CPU Limit: not set

App experiences startup delay & restarts when:

CPU Request: 3 | CPU Limit: 4

CPU Request: 4 | CPU Limit: 4, 5, or 6

No other containers are running on this pod or node.

Thread Dump Observations:

When the startup delay occurs, I see blocked or waiting threads related to Managed Identity authentication.

When the app starts fine, no such waiting or blocked threads are observed.

Questions:

  1. Could this inconsistent startup behavior be related to CPU allocation, throttling, or scheduling in AKS?
  2. Is there any known impact of CPU request/limit values on Managed Identity token retrieval in AKS?
  3. Any debugging recommendations (e.g., AKS logs, Managed Identity diagnostics) to further investigate why authentication threads are blocked in certain CPU configurations?

Would appreciate any insights! Thanks in advance.

Azure SQL Database
Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,447 questions
Windows for business | Windows Client for IT Pros | Directory services | Active Directory
Microsoft Security | Microsoft Entra | Microsoft Entra ID
{count} votes

1 answer

Sort by: Most helpful
  1. Mounika Reddy Anumandla 6,570 Reputation points Microsoft External Staff Moderator
    2025-02-14T10:22:13.8433333+00:00

    Hi Manakkal. Subash,

    Thank you for replying back with further information.

    As there are other critical workloads running on the cluster, removing CPU limits entirely could jeopardize other workloads during contention because Kubernetes allows unlimited CPU usage when limits are not set.
    Instead of removing CPU limits, try adjusting them properly by setting a slightly higher limit. This ensures consistent CPU allocation without throttling. This also allows occasional bursts while still preventing excessive resource consumption.

    Checking the throttling rate of your pods:

    Just login to the pod and run cat /sys/fs/cgroup/cpu/cpu.stat.

    Please feel free to tag me in the comments for further assistance.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.