Brian Bertrand Thank you for the detailed context. Based on our review, this behavior is expected with Azure Batch pools that reference Marketplace images using latest, especially for long‑running production systems.
Your Batch pools are using microsoft-dsvm:ubuntu-hpc:2404:latest.
This image was updated around Jan 29, which introduced changes to the base OS footprint and preinstalled packages.
As a result:
Some nodes entered Unusable state due to OS disk exhaustion (resolved by increasing OS disk to 128 GB).
Newer nodes no longer consistently include runtime dependencies (e.g., Python, GDAL), causing task failures with exit codes 127 / 255.
This is by design: Azure Marketplace images are serviced and updated automatically, and dependency immutability is not guaranteed when using latest.
Your workload relied on implicit availability of system libraries from the Marketplace image. When the image was updated, those assumptions no longer held, leading to non‑deterministic node behavior during scale‑out.
Recommended way to permanently stop this - For production Batch workloads, Microsoft recommends one of the following supported patterns:
- For production Batch workloads, Microsoft recommends one of the following supported patterns:
- Use a custom image via Azure Compute Gallery (Recommended)
- Create a VM from a known‑good Ubuntu‑HPC image.
- Install and validate all required dependencies (Python, GDAL, etc.).
- Capture it into Azure Compute Gallery and point the Batch pool to a specific image version.
- This guarantees runtime stability and prevents breaking changes from Marketplace updates.
- Containerize the workload
- Run Batch tasks inside Docker/Singularity containers. This fully decouples your application runtime from the host OS and avoids image drift issues.
- Avoid relying on latest Marketplace images
- Pinning a Marketplace image version can be used temporarily, but it is not recommended long‑term, as older versions may be retired without notice.
Use Azure Batch to run container workloads - https://docs.azure.cn/en-us/batch/batch-docker-container-workloads
Use the Azure Compute Gallery to create a custom image pool - https://learn.microsoft.com/en-us/azure/batch/batch-sig-images