Azure Batch Docker container doesn't have access to GPU when running task

Jørgen Bøndergaard Iversen 0 Reputation points
2025-11-12T08:25:43.6266667+00:00

I have setup a pool on my Azure Batch account with the following settings:

Operating system/image: microsoft-dsvm ubuntu-hpc 2204

VM Size: Standard_NV12s_v3

Extensions: NvidiaGpuDriverLinux, version 1.4, publisher Microsoft.HpcCompute

The nodes are set to scale to zero, and scale up to one instance when a task is pending/running.

I have a docker image based on nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 that runs some pytorch code using the GPU (CUDA, NVIDIA).

I notice that when starting a task that should run the docker image it fails because it can't find the GPU, but if I spin up a node and ssh to and run the docker image in the shell it succeeds.
The task settings are:

Elevation level: Task autouser, Admin

Container: crvusdxprodne001.azurecr.io/pipeline-peg-rna-azure-container-instance:latest (based on the nvidia image)
Container run options: --rm -v /mnt/batch/tasks/fsmounts/io:/mnt/io

I suspect that the NVIDIA drivers are not ready yet when the task starts to execute, but I have tried with the following start task on the node, so I would expect the NVIDIA drivers to be ready on the host node:

MAX_WAIT=300   # seconds
INTERVAL=5
ELAPSED=0

until nvidia-smi &>/dev/null; do
  if [ $ELAPSED -ge $MAX_WAIT ]; then
    echo "Timed out waiting for NVIDIA driver after ${MAX_WAIT}s"
    exit 1
  fi
  echo "Waiting for NVIDIA driver... (${ELAPSED}s elapsed)"
  sleep $INTERVAL
  ELAPSED=$((ELAPSED + INTERVAL))
done

echo "NVIDIA driver ready!"

echo "=== Checking native.cgroupdriver=cgroupfs ==="
if ! sudo grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
  echo "-> Adding native.cgroupdriver=cgroupfs"
  tmp=$(mktemp)
  sudo jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
  sudo mv -f "$tmp" /etc/docker/daemon.json
  RESTART_DOCKER=1
else
  echo "Already set"
fi
echo
echo "Current 'exec-opts' in /etc/docker/daemon.json:"
sudo jq '.["exec-opts"]' /etc/docker/daemon.json
echo
echo "=== Checking NVIDIA container config ==="
if ! sudo grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
  echo "-> Disabling no-cgroups in NVIDIA config"
  sudo sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
  RESTART_DOCKER=1
else
  echo "no-cgroups already false"
fi
echo
echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
sudo grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml
echo
# Restart Docker if changes were made
if [ "$RESTART_DOCKER" == "1" ]; then
  echo "=== Restarting Docker ==="
  sudo systemctl restart docker
  sleep 5
fi

What could explain this behaviour?

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Deepanshu katara 17,955 Reputation points MVP Moderator
    2025-11-12T10:52:25.2566667+00:00

    Hello , Welcome to MSQ&A

    1. When you SSH into the node, the NVIDIA driver extension has already finished installing, and the nvidia-container-runtime is properly configured.
    2. In Azure Batch, when the pool scales from zero, the start task and container task can overlap. If the container task starts before:
      • The NVIDIA GPU driver extension finishes installing
      • Docker is restarted with the correct exec-opts and NVIDIA runtime settings then the container will not detect the GPU (nvidia-smi fails inside container).
    • Use waitForSuccess on the Start Task in your pool configuration to ensure no tasks begin until the start task has completed successfully.
    • Increase the wait time — 300 seconds may be too short, as the NVIDIA extension can take 5 to 10 minutes to initialize.
    • Add GPU flag to container run options: --gpus all --rm -v /mnt/batch/tasks/fsmounts/io:/mnt/io
    • Verify NVIDIA runtime: Ensure /etc/docker/daemon.json includes "default-runtime": "nvidia" and "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }.
    • Consider pre-provisioned pool: Instead of scaling from zero, keep one node warm to avoid driver installation delays.

    Best Solution - Microsoft recommends using Azure Batch with container pools configured for GPU and pre-installed drivers rather than relying on extensions at runtime.

    Link --> https://learn.microsoft.com/en-us/azure/container-instances/container-instances-gpu

    Pls check this and if it helps

    Thanks

    Deepanshu


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.