Jobs Getting Suspended in Azure Container Apps (KEDA Queue-based Triggered)

Question

Jobs Getting Suspended in Azure Container Apps (KEDA Queue-based Triggered)

Arthur Kerst 5

Some of our Azure Container App Jobs are being suspended unexpectedly. The logs show "Suspending Scale Job: jobname". We expect that this issue is related to the scaling of the job executions. The jobs are triggered via KEDA based on messages in a Service Bus Queue. However, some jobs are suspended/stoppped despite the queue having messages, and no other job executions are started.

Additionally, the following messages appear intermittently:

0/3 nodes are available: 1 node(s) had untolerated taint {virtual-kubelet.io/provider: legion}, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

We need help understanding why jobs are suspended and why these node-related log messages appear.

Relevant Configuration Details:

The workload profile is specified as:

resource containerAppJob 'Microsoft.App/jobs@2024-08-02-preview' = {
  name: jobname
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    environmentId: managedEnvironment.id
    workloadProfileName: 'pt-D8-8-32'
    template: {
      containers: [
        {
          name: 'containername'
          image: 'name.azurecr.io/image:latest'
          imageType: 'ContainerImage'
          command: [
            'python'
          ]
          args: [
            '-m'
            'src.core_processing.process'
          ]
          resources: {
            cpu: json('8')
            memory: '32Gi'
          }
        }
      ]
    }
    configuration: {
      registries: [
        {
          server: 'name.azurecr.io'
          identity: 'system'
        }
      ]
      triggerType: 'Event'
      replicaTimeout: 3600
      replicaRetryLimit: 0
      eventTriggerConfig: {
        replicaCompletionCount: 1
        parallelism: 1
        scale: {
          minExecutions: 0
          maxExecutions: 2
          pollingInterval: 5
          rules: [
            {
              name: 'message-start-job'
              type: 'azure-servicebus'
              metadata: {
                activationMessageCount: '0'
                messageCount: '1'
                namespace: serviceBusNamespaceName
                queueName: serviceBusQueueProcess
              } 
              auth: []
              identity: 'system'
            }
          ]
        }
      }
    } 
  }
}

      {
        workloadProfileType: 'D8'
        name: 'pt-D8-8-32'
        enableFips: false
        minimumCount: 0
        maximumCount: 4
      }

What I Need Help With:

Why are jobs getting suspended despite messages in the queue?
Are the node-related errors causing job suspensions?
Is there any configuration adjustment needed to prevent suspensions and improve scaling?

2 answers

Your answer

Answer 1

Khadeer Ali 5,990 Microsoft External Staff Moderator

@Arthur Kerst ,

Thanks for reaching out. Jobs in Azure Container Apps can get suspended for various reasons, especially when using KEDA for scaling based on Service Bus Queue messages. Here are some possible reasons and related log messages:

Insufficient Resources: If there aren't enough resources (e.g., CPU, memory), KEDA might suspend the jobs. The message 0/3 means there are no available nodes, either because they're at capacity or none are available.
Scaling Configuration: If your scaling settings aren't optimal, KEDA might scale down to 0 if it can't process messages or allocate resources.
Job Execution Limits: If jobs aren't set to handle multiple messages at once, they might not start new executions even if there are messages in the queue.
Node Pool Configuration: If your Kubernetes node pool lacks enough nodes, you might need to increase the number or adjust autoscaling settings to ensure enough resources.

To avoid suspensions and improve scaling, consider these tweaks:

Increase the minimum count of job replicas to always have instances ready to process messages.
Check the resource limits and requests for your jobs to ensure they're suitable for the workload.
Monitor your Kubernetes nodes' health and capacity to ensure they can handle the load.

Arthur Kerst 5 Reputation points

2025-03-05T12:38:46.4766667+00:00

Hi @Khadeer Ali , thanks for your response. Please view my next comment for some additional questions/remarks.
Arthur Kerst 5 Reputation points

2025-03-05T12:46:41.1433333+00:00
Hi @Khadeer Ali , thanks for your response.

We are having trouble understanding why there are insufficient resources. In our Bicep file, we specified that up to 4 nodes with the D8 workload profile can be created. Additionally, we set the maxExecutions value to 2 in the scaling rule of the Container App Job. This should ensure that there are always enough resources to scale up to 2 executions.

We reduced the resource allocation for the Container App Job to:

resources: { cpu: json('4') memory: '16Gi' }

However, this did not resolve the issue. After some time, one of the jobs got suspended, and the "nodes are available" log messages continued to appear. We noticed this happens when no job executions have occurred for a while. Additionally, we monitored the CPU and memory usage of the container, confirming that 4 cores and 16GB RAM are more than sufficient for the workload.

Regarding your suggestion to increase the minimum count of job replicas, this is not a viable solution for us, as the job runs infrequently. Keeping a replica always running would result in unnecessary costs.

We attempted to monitor the Kubernetes nodes' health and capacity to ensure they can handle the load. However, this does not seem possible for Azure Container App Jobs, as Container Insights does not support them. Could you clarify how we can monitor node health specifically for Container App Jobs?

Regarding your Job Execution Limits remark; can you clarify this? We want to start a new job for each message on the queue up to two jobs at the same time. So if there are 4 messages on the queue, two jobs are started and if one job is done processing the message, the job is completed and a new job is started to process the third message, and so on. This seems to work at least for a while. You mentioned that if jobs aren't set to handle multiple messages at once, they might not start new executions even if there are messages in the queue. Can you clarify why this is? And what the better approach is?

Looking forward to your guidance!

Best regards, Arthur

Arko 4,150 Microsoft External Staff Moderator

To set up Azure Container Apps Jobs with KEDA that scales based on Service Bus Queue messages and applies necessary fixes to prevent job suspensions.

Dockerize and deploy your application that will process messages from the Service Bus queue.

Sample script-


resource containerAppJob 'Microsoft.App/jobs@2024-08-02-preview' = {

  name: 'jobprocessor'

  location: 'centralindia'

  identity: {

    type: 'SystemAssigned'

  }

  properties: {

    environmentId: resourceId('Microsoft.App/managedEnvironments', 'arkenv')

    workloadProfileName: 'pt-D8-8-32' // Ensure workload profile exists

    template: {

      containers: [

        {

          name: 'job-container'

          image: 'arkoacr.azurecr.io/jobprocessor:latest'

          imageType: 'ContainerImage'

          command: ['python']

          args: ['process.py']

          resources: {

            cpu: 4

            memory: '16Gi'

          }

        }

      ]

    }

    configuration: {

      registries: [

        {

          server: 'arkoacr.azurecr.io'

          identity: 'system'

        }

      ]

      triggerType: 'Event'

      replicaTimeout: 3600

      replicaRetryLimit: 0

      eventTriggerConfig: {

        replicaCompletionCount: 1

        parallelism: 2

        scale: {

          minExecutions: 0

          maxExecutions: 4

          pollingInterval: 3

          rules: [

            {

              name: 'servicebus-trigger'

              type: 'azure-servicebus'

              metadata: {

                activationMessageCount: '0'

                messageCount: '1'

                namespace: 'arkservicebus'

                queueName: 'jobqueue'

              }

              auth: []

              identity: 'system'

            }

          ]

        }

      }

    }

  }

}

Dockerfile for the same-


FROM python:3.9

WORKDIR /app

COPY process.py .

RUN pip install azure-servicebus

CMD ["python", "process.py"]

enter image description here

Now will define and deploy a job that scales with KEDA based on Service Bus messages.


resource containerAppJob 'Microsoft.App/jobs@2024-08-02-preview' = {

  name: 'jobprocessor'

  location: 'centralindia'

  identity: {

    type: 'SystemAssigned'

  }

  properties: {

    environmentId: resourceId('Microsoft.App/managedEnvironments', 'arkenv')

    workloadProfileName: 'pt-D8-8-32' // modify with your own workload profile which you created earlier

    template: {

      containers: [

        {

          name: 'job-container'

          image: 'arkoacr.azurecr.io/jobprocessor:latest'

          imageType: 'ContainerImage'

          command: ['python']

          args: ['process.py']

          resources: {

            cpu: 4

            memory: '16Gi'

          }

        }

      ]

    }

    configuration: {

      registries: [

        {

          server: 'arkoacr.azurecr.io'

          identity: 'system'

        }

      ]

      triggerType: 'Event'

      replicaTimeout: 3600

      replicaRetryLimit: 0

      eventTriggerConfig: {

        replicaCompletionCount: 1

        parallelism: 2

        scale: {

          minExecutions: 0

          maxExecutions: 4

          pollingInterval: 3

          rules: [

            {

              name: 'servicebus-trigger'

              type: 'azure-servicebus'

              metadata: {

                activationMessageCount: '0'

                messageCount: '1'

                namespace: 'arkservicebus'

                queueName: 'jobqueue'

              }

              auth: []

              identity: 'system'

            }

          ]

        }

      }

    }

  }

}

deploy the file and wait for the job to show


az deployment group create --resource-group arkorg --template-file container-app-job.bicep

az containerapp job show --name jobprocessor --resource-group arkorg

Once ready, check the service bus queue message-


az servicebus queue show --name jobqueue --namespace-name arkservicebus --resource-group arkorg

Give it a try.

Arthur Kerst 5 Reputation points

2025-03-07T08:17:18.3966667+00:00

Thank you for your response. However, I don’t see any significant difference between your approach and ours. You’ve set parallelism to 2, but this does not align with our desired scaling behavior. According to the documentation, parallelism should typically be set to 1 for most jobs.

Additionally, you’ve set maxExecutions to 4, but we don’t see how this would resolve the issues we’re encountering. We specifically want to limit executions to 2.

Could you please address the questions from my previous comment? Your insights would be greatly appreciated.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-10T08:23:01.3766667+00:00

The reason why I've increased max execution to 4 because instead of relying on strict execution limit, using a slightly higher maxExecutions (e.g. 3 or 4) will ensures jobs always restart when new messages are present. You can keep parallelism at 1.

One more thing I noticed is your error says- 0/3 nodes are available: 1 node(s) had untolerated taint {virtual-kubelet.io/provider: legion}, 2 node(s) didn't match Pod's node affinity/selector.

which means your nodes have taints on them that is preventing job execution. Check them once.

What is your current workload profile? Your workload profile may not be scaling up in time when jobs need to execute.
Arthur Kerst 5 Reputation points

2025-03-10T15:33:40.5933333+00:00
Hi Arko, thanks for your response.

I have followed this tutorial: https://learn.microsoft.com/en-us/azure/container-apps/tutorial-event-driven-jobs. My observations with this tutorial:

When using the Consumption workload profile, the job is executed immediately. There are no 0/3 nodes are available/untolerated taint error logs.

When changing the workload profile to a Dedicated workload profile (D8), the first two job executions takes ~3 min, and there are 0/3 nodes are available/untolerated taint error logs.

So apparently this error occurs when using a Dedicated workload profile, which we are using as well. We have the following workload profile:

{ workloadProfileType: 'D8' name: 'pt-D8-8-32' enableFips: false minimumCount: 0 maximumCount: 4 }

It seems like the nodes cannot be created in time for the jobs to execute. However, using a Consumption workload profile, or always have a node with a Dedicated workload profile available is not a feasible solution for us. Any suggestions?
And how should we deal with untolerated tainted nodes?
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-31T07:16:02.76+00:00
Hello Arthur Kerst, The "0/3 nodes are available" and "untolerated taint {virtual-kubelet.io/provider: legion}" errors you're seeing happen because your jobs are being scheduled before any Dedicated nodes have fully spun up. With minimumCount: 0, there are no pre-warmed nodes, so the platform needs time (usually ~2–3 minutes) to provision a new one, which can delay job execution or lead to suspension if the scheduler times out.

The taint error comes from the scheduler trying to place jobs on virtual (Consumption) nodes, but Dedicated jobs don’t tolerate those taints — hence the failure.

If keeping a Dedicated node always available isn't feasible cost-wise, here are a couple of options:

Warm-up Strategy: Schedule a lightweight recurring job (e.g., every 20–30 mins) during working hours to keep at least one node warm without running 24/7.

Adjust Retry/Timeouts: Increase replicaTimeout and allow replicaRetryLimit > 0 so jobs retry gracefully during scale-up delays.

Taint Toleration (Not recommended): You could technically add tolerations to run on tainted nodes, but this defeats the purpose of Dedicated workloads and may impact performance.

I hope this helps to clear your query.
Cezary Klus 0 Reputation points

2025-04-14T09:35:45.2666667+00:00

Hi Arko, I am impacted by the same issue. My requirement is to have the Workload Profile scaled to zero. Scale up from 0 to 1 is taking more than 20 minutes. I am in West Europe region.

Please make it clear whether this is what ACA team sees as the normal behaviour not an issue.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-04-16T04:38:57.43+00:00

Hello Klus, the 20 minutes delay which you are seeing when scaling from zero in a Dedicated workload profile is not a bug, but rather an expected platform behavior in certain regions and scenarios. When a workload profile is set with minimumCount: 0, Azure Container Apps needs to provision a new node from scratch when a job is triggered. This involves underlying VM allocation and configuration, which can take several minutes especially during high demand periods or in regions like West Europe, which often face capacity constraints.

While this behavior is currently expected, I understand that the ACA team is aware of this limitation and actively working on improvements, but as of now, there’s no SLA-guaranteed warm-up time for Dedicated profiles scaling from zero.

As a workaround, you can temporarily set minimumCount: 1 to keep at least one node warm during business hours. Thanks

Answer 2

Arslan Amanat 0

Hey i am having the same issue using Github Action with Azure CLI . I have observed when i create Azure Containerapp Job via the GUI (portal) it create the connection successfully listening to the Azure Queue . Whenever there is a message in the queue it starts the container job execution. But when i update the container with a new images using github actions somehow the queueLength=1 stops working and now it starts the execution of container after there are 5 messages in the queue .

Share via

Jobs Getting Suspended in Azure Container Apps (KEDA Queue-based Triggered)

2 answers

Your answer