Share via

Evicted By Legion

Bharathkanth R 20 Reputation points
2026-03-11T07:24:12.1466667+00:00

I am running Azure Container App Jobs that are triggered by an Azure Blob Queue. Recently, some job executions have started failing in the middle of execution with the following system logs:

Error Logs:

  • Pod - prod-xxx-xxx-small-job-96l9n-lp9wb has a failed container with name: small-etl-job

Exit Code: 1

Reason: Evicted By Legion

Message: Job has reached the specified backoff limit

These failures are occurring unexpectedly during runtime, even though the jobs had been running successfully earlier.

I checked the resource metrics, and everything appears normal—there are no abnormal spikes in CPU or memory usage. Each job execution runs under the Consumption workload profile with the following configuration:

4 vCPU

8 GB RAM

Since the resource utilization looks normal, it is unclear why the pods are being evicted by Legion, which is causing the job executions to fail after reaching the backoff limit.I am running Azure Container App Jobs that are triggered by an Azure Blob Queue. Recently, some job executions have started failing in the middle of execution with the following system logs:

Error Logs:

Pod - prod-xxx-xxx-small-job-96l9n-lp9wb has a failed container with name: small-etl-job

Exit Code: 1

Reason: Evicted By Legion

Message: Job has reached the specified backoff limit

These failures are occurring unexpectedly during runtime, even though the jobs had been running successfully earlier.

I checked the resource metrics, and everything appears normal—there are no abnormal spikes in CPU or memory usage. Each job execution runs under the Consumption workload profile with the following configuration:

4 vCPU

8 GB RAM

Since the resource utilization looks normal, it is unclear why the pods are being evicted by Legion, which is causing the job executions to fail after reaching the backoff limit.

Azure Container Apps
Azure Container Apps

An Azure service that provides a general-purpose, serverless container platform.


Answer accepted by question author
  1. Siddhesh Desai 5,130 Reputation points Microsoft External Staff Moderator
    2026-03-11T08:06:43.3333333+00:00

    Hi @Bharathkanth R

    Thank you for reaching out to Microsoft Q&A.

    We have received below response from the backend Engineering team:

    Our investigation confirmed that the job failures were caused by a Legion infrastructure upgrade (base VM image rollout) on 2026-03-10, which evicted running pods on affected hosts. The job recovered fully after the upgrade completed, with 245 successful executions and zero eviction-related failures in the following 3 days.

    However, the reason each eviction immediately resulted in a permanent job failure is because the job is configured with replicaRetryLimit: 0 (no retries). With this setting, any single pod failure — whether caused by platform maintenance, transient network issues, or infrastructure upgrades — immediately fails the entire job execution with BackoffLimitExceeded, with no opportunity to retry on a healthy host.

    Current Configuration:

    • replicaRetryLimit: 0 (confirmed from the customer's Pulumi deployment on 2026-02-05 and Portal PATCH on 2026-03-09)
    • This value was never explicitly set by the customer — it defaulted to 0 because the property was omitted in the Pulumi IaC template.

    Recommendation:

    Set replicaRetryLimit to at least 1 (recommended: 2 or 3) to allow the platform to automatically retry failed job pods on a different, healthy host. This provides resilience against:

    • Platform infrastructure upgrades (VHD rollouts, host reimages)
    • Transient host failures or nested VM unavailability
    • Network blips during pod placement

    How to Update:

    Azure CLI:

    az containerapp job update \
      --name prod-verixai-cormetrix-small-job \
      --resource-group prod-verixai-cormetrix-rg \
      --replica-retry-limit 3
    

    Pulumi IaC (Go): Add the ReplicaRetryLimit property to the job configuration:

    Configuration: &app.JobConfigurationArgs{
        TriggerType: pulumi.String("Event"),
        ReplicaTimeout: pulumi.Int(54000),
        ReplicaRetryLimit: pulumi.Int(3), // Add this line
        ...
    }
    

    ARM API:

    "configuration": {
        "replicaRetryLimit": 3,
        ...
    }
    

    This change does not affect the behavior of successful executions. It only allows the platform to retry pods that fail due to transient infrastructure events, significantly improving job reliability during platform maintenance windows.Our investigation confirmed that the job failures were caused by a Legion infrastructure upgrade (base VM image rollout) on 2026-03-10, which evicted running pods on affected hosts. The job recovered fully after the upgrade completed, with 245 successful executions and zero eviction-related failures in the following 3 days.

    However, the reason each eviction immediately resulted in a permanent job failure is because the job is configured with replicaRetryLimit: 0 (no retries). With this setting, any single pod failure — whether caused by platform maintenance, transient network issues, or infrastructure upgrades — immediately fails the entire job execution with BackoffLimitExceeded, with no opportunity to retry on a healthy host.

    Current Configuration:

    • replicaRetryLimit: 0 (confirmed from the customer's Pulumi deployment on 2026-02-05 and Portal PATCH on 2026-03-09)
    • This value was never explicitly set by the customer — it defaulted to 0 because the property was omitted in the Pulumi IaC template.

    Recommendation:

    Set replicaRetryLimit to at least 1 (recommended: 2 or 3) to allow the platform to automatically retry failed job pods on a different, healthy host. This provides resilience against:

    • Platform infrastructure upgrades (VHD rollouts, host reimages)
    • Transient host failures or nested VM unavailability
    • Network blips during pod placement

    How to update:

    Azure CLI:

    az containerapp job update \
      --name prod-verixai-cormetrix-small-job \
      --resource-group prod-verixai-cormetrix-rg \
      --replica-retry-limit 3
    

    Pulumi IaC (Go): Add the ReplicaRetryLimit property to the job configuration:

    Configuration: &app.JobConfigurationArgs{
        TriggerType: pulumi.String("Event"),
        ReplicaTimeout: pulumi.Int(54000),
        ReplicaRetryLimit: pulumi.Int(3), // Add this line
        ...
    }
    

    ARM API:

    "configuration": {
        "replicaRetryLimit": 3,
        ...
    }
    

    This change does not affect the behavior of successful executions. It only allows the platform to retry pods that fail due to transient infrastructure events, significantly improving job reliability during platform maintenance windows.

    If the resolution was helpful, kindly take a moment to click on 210246-screenshot-2021-12-10-121802.pngand click on Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.