Share via

Azure Batch Nodes Starting With Programs/Packages Missing

Brian Bertrand 21 Reputation points
2026-03-05T14:02:54.4866667+00:00

Hi there,

I've recently run into a problem where our batch system (that has been running for 5+ years) is randomly spawning nodes that our tasks cannot run on. I will receive errors such as 255, 127, etc. Sometimes python is missing, something it's a gdal library, etc.

Prior to this issue, around the end of January the nodes began spawning in unusable state as they ran out of disk space. Increasing the OS to 128gb seemed to fix this. Looking at the node image offer, it was indeed updated on Jan 29th - clearly something changed.

I've tried figuring out how to set my pool to use the old version, but it seems rather convoluted.

I have no way to reach out to Microsoft about this issue as their new support system seems to be AI only for batch issues. Unfortunately it isn't very helpful.

Has anyone else run into this issue? Any suggestions to stop this from happening?

Some pool info (latest version):
Publisher microsoft-dsvm
Offer Ubuntu-hpc
Sku 2404
Version 22.04.2026021901

Azure Batch
Azure Batch

An Azure service that provides cloud-scale job scheduling and compute management.


2 answers

Sort by: Most helpful
  1. Brendan-8792 0 Reputation points
    2026-03-12T18:06:35.8766667+00:00

    I am also experiencing this same issue. We are using the same Ubuntu HPC base image and run containerized workloads using custom Docker images that contain all the dependencies we need to run (interestingly we also use Python and GDAL). We serve the images from Azure Container Registry. They are updated fairly regularly via CI/CD and used in long-lived pools where the image configuration is set in tasks (which is used preferentially to the pool image configuration).

    0 comments No comments

  2. Himanshu Shekhar 5,225 Reputation points Microsoft External Staff Moderator
    2026-03-05T14:24:05.79+00:00

    Brian Bertrand Thank you for the detailed context. Based on our review, this behavior is expected with Azure Batch pools that reference Marketplace images using latest, especially for long‑running production systems.

    Your Batch pools are using microsoft-dsvm:ubuntu-hpc:2404:latest.

    This image was updated around Jan 29, which introduced changes to the base OS footprint and preinstalled packages.

    As a result:

    Some nodes entered Unusable state due to OS disk exhaustion (resolved by increasing OS disk to 128 GB).

    Newer nodes no longer consistently include runtime dependencies (e.g., Python, GDAL), causing task failures with exit codes 127 / 255.

    This is by design: Azure Marketplace images are serviced and updated automatically, and dependency immutability is not guaranteed when using latest.

    Your workload relied on implicit availability of system libraries from the Marketplace image. When the image was updated, those assumptions no longer held, leading to non‑deterministic node behavior during scale‑out.

    Recommended way to permanently stop this - For production Batch workloads, Microsoft recommends one of the following supported patterns:

    1. For production Batch workloads, Microsoft recommends one of the following supported patterns:
    2. Use a custom image via Azure Compute Gallery (Recommended)
    3. Create a VM from a known‑good Ubuntu‑HPC image.
    4. Install and validate all required dependencies (Python, GDAL, etc.).
    5. Capture it into Azure Compute Gallery and point the Batch pool to a specific image version.
    6. This guarantees runtime stability and prevents breaking changes from Marketplace updates.
    7. Containerize the workload
    8. Run Batch tasks inside Docker/Singularity containers. This fully decouples your application runtime from the host OS and avoids image drift issues.
    9. Avoid relying on latest Marketplace images
    10. Pinning a Marketplace image version can be used temporarily, but it is not recommended long‑term, as older versions may be retired without notice.

    Use Azure Batch to run container workloads - https://docs.azure.cn/en-us/batch/batch-docker-container-workloads

    Use the Azure Compute Gallery to create a custom image pool - https://learn.microsoft.com/en-us/azure/batch/batch-sig-images

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.