An Azure service that provides a general-purpose, serverless container platform.
Thank you for reaching out to Microsoft Q&A.
We have received below response from the backend Engineering team:
Our investigation confirmed that the job failures were caused by a Legion infrastructure upgrade (base VM image rollout) on 2026-03-10, which evicted running pods on affected hosts. The job recovered fully after the upgrade completed, with 245 successful executions and zero eviction-related failures in the following 3 days.
However, the reason each eviction immediately resulted in a permanent job failure is because the job is configured with replicaRetryLimit: 0 (no retries). With this setting, any single pod failure — whether caused by platform maintenance, transient network issues, or infrastructure upgrades — immediately fails the entire job execution with BackoffLimitExceeded, with no opportunity to retry on a healthy host.
Current Configuration:
-
replicaRetryLimit: 0(confirmed from the customer's Pulumi deployment on 2026-02-05 and Portal PATCH on 2026-03-09) - This value was never explicitly set by the customer — it defaulted to 0 because the property was omitted in the Pulumi IaC template.
Recommendation:
Set replicaRetryLimit to at least 1 (recommended: 2 or 3) to allow the platform to automatically retry failed job pods on a different, healthy host. This provides resilience against:
- Platform infrastructure upgrades (VHD rollouts, host reimages)
- Transient host failures or nested VM unavailability
- Network blips during pod placement
How to Update:
Azure CLI:
az containerapp job update \
--name prod-verixai-cormetrix-small-job \
--resource-group prod-verixai-cormetrix-rg \
--replica-retry-limit 3
Pulumi IaC (Go): Add the ReplicaRetryLimit property to the job configuration:
Configuration: &app.JobConfigurationArgs{
TriggerType: pulumi.String("Event"),
ReplicaTimeout: pulumi.Int(54000),
ReplicaRetryLimit: pulumi.Int(3), // Add this line
...
}
ARM API:
"configuration": {
"replicaRetryLimit": 3,
...
}
This change does not affect the behavior of successful executions. It only allows the platform to retry pods that fail due to transient infrastructure events, significantly improving job reliability during platform maintenance windows.Our investigation confirmed that the job failures were caused by a Legion infrastructure upgrade (base VM image rollout) on 2026-03-10, which evicted running pods on affected hosts. The job recovered fully after the upgrade completed, with 245 successful executions and zero eviction-related failures in the following 3 days.
However, the reason each eviction immediately resulted in a permanent job failure is because the job is configured with replicaRetryLimit: 0 (no retries). With this setting, any single pod failure — whether caused by platform maintenance, transient network issues, or infrastructure upgrades — immediately fails the entire job execution with BackoffLimitExceeded, with no opportunity to retry on a healthy host.
Current Configuration:
-
replicaRetryLimit: 0(confirmed from the customer's Pulumi deployment on 2026-02-05 and Portal PATCH on 2026-03-09) - This value was never explicitly set by the customer — it defaulted to 0 because the property was omitted in the Pulumi IaC template.
Recommendation:
Set replicaRetryLimit to at least 1 (recommended: 2 or 3) to allow the platform to automatically retry failed job pods on a different, healthy host. This provides resilience against:
- Platform infrastructure upgrades (VHD rollouts, host reimages)
- Transient host failures or nested VM unavailability
- Network blips during pod placement
How to update:
Azure CLI:
az containerapp job update \
--name prod-verixai-cormetrix-small-job \
--resource-group prod-verixai-cormetrix-rg \
--replica-retry-limit 3
Pulumi IaC (Go): Add the ReplicaRetryLimit property to the job configuration:
Configuration: &app.JobConfigurationArgs{
TriggerType: pulumi.String("Event"),
ReplicaTimeout: pulumi.Int(54000),
ReplicaRetryLimit: pulumi.Int(3), // Add this line
...
}
ARM API:
"configuration": {
"replicaRetryLimit": 3,
...
}
This change does not affect the behavior of successful executions. It only allows the platform to retry pods that fail due to transient infrastructure events, significantly improving job reliability during platform maintenance windows.
If the resolution was helpful, kindly take a moment to click on
and click on Yes for was this answer helpful. And, if you have any further query do let us know.