Azure Batch Docker container doesn't have access to GPU when running task

Question

Azure Batch Docker container doesn't have access to GPU when running task

Jørgen Bøndergaard Iversen 0

I have setup a pool on my Azure Batch account with the following settings:

Operating system/image: microsoft-dsvm ubuntu-hpc 2204

VM Size: Standard_NV12s_v3

Extensions: NvidiaGpuDriverLinux, version 1.4, publisher Microsoft.HpcCompute

The nodes are set to scale to zero, and scale up to one instance when a task is pending/running.

I have a docker image based on nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 that runs some pytorch code using the GPU (CUDA, NVIDIA).

I notice that when starting a task that should run the docker image it fails because it can't find the GPU, but if I spin up a node and ssh to and run the docker image in the shell it succeeds.
The task settings are:

Elevation level: Task autouser, Admin

Container: crvusdxprodne001.azurecr.io/pipeline-peg-rna-azure-container-instance:latest (based on the nvidia image)
Container run options: --rm -v /mnt/batch/tasks/fsmounts/io:/mnt/io

I suspect that the NVIDIA drivers are not ready yet when the task starts to execute, but I have tried with the following start task on the node, so I would expect the NVIDIA drivers to be ready on the host node:

MAX_WAIT=300   # seconds
INTERVAL=5
ELAPSED=0

until nvidia-smi &>/dev/null; do
  if [ $ELAPSED -ge $MAX_WAIT ]; then
    echo "Timed out waiting for NVIDIA driver after ${MAX_WAIT}s"
    exit 1
  fi
  echo "Waiting for NVIDIA driver... (${ELAPSED}s elapsed)"
  sleep $INTERVAL
  ELAPSED=$((ELAPSED + INTERVAL))
done

echo "NVIDIA driver ready!"

echo "=== Checking native.cgroupdriver=cgroupfs ==="
if ! sudo grep -q "native.cgroupdriver=cgroupfs" /etc/docker/daemon.json; then
  echo "-> Adding native.cgroupdriver=cgroupfs"
  tmp=$(mktemp)
  sudo jq '. + { "exec-opts": ["native.cgroupdriver=cgroupfs"] }' /etc/docker/daemon.json > "$tmp"
  sudo mv -f "$tmp" /etc/docker/daemon.json
  RESTART_DOCKER=1
else
  echo "Already set"
fi
echo
echo "Current 'exec-opts' in /etc/docker/daemon.json:"
sudo jq '.["exec-opts"]' /etc/docker/daemon.json
echo
echo "=== Checking NVIDIA container config ==="
if ! sudo grep -q "^no-cgroups = false" /etc/nvidia-container-runtime/config.toml; then
  echo "-> Disabling no-cgroups in NVIDIA config"
  sudo sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml
  RESTART_DOCKER=1
else
  echo "no-cgroups already false"
fi
echo
echo "Lines around 'no-cgroups' in /etc/nvidia-container-runtime/config.toml:"
sudo grep -C2 "no-cgroups" /etc/nvidia-container-runtime/config.toml
echo
# Restart Docker if changes were made
if [ "$RESTART_DOCKER" == "1" ]; then
  echo "=== Restarting Docker ==="
  sudo systemctl restart docker
  sleep 5
fi

What could explain this behaviour?

1 answer

Your answer

Answer 1

Hello , Welcome to MSQ&A

When you SSH into the node, the NVIDIA driver extension has already finished installing, and the nvidia-container-runtime is properly configured.
In Azure Batch, when the pool scales from zero, the start task and container task can overlap. If the container task starts before:
- The NVIDIA GPU driver extension finishes installing
- Docker is restarted with the correct exec-opts and NVIDIA runtime settings then the container will not detect the GPU (nvidia-smi fails inside container).

Recommended Fixes

Use waitForSuccess on the Start Task in your pool configuration to ensure no tasks begin until the start task has completed successfully.
Increase the wait time — 300 seconds may be too short, as the NVIDIA extension can take 5 to 10 minutes to initialize.
Add GPU flag to container run options: --gpus all --rm -v /mnt/batch/tasks/fsmounts/io:/mnt/io
Verify NVIDIA runtime: Ensure /etc/docker/daemon.json includes "default-runtime": "nvidia" and "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }.
Consider pre-provisioned pool: Instead of scaling from zero, keep one node warm to avoid driver installation delays.

Best Solution - Microsoft recommends using Azure Batch with container pools configured for GPU and pre-installed drivers rather than relying on extensions at runtime.

Link --> https://learn.microsoft.com/en-us/azure/container-instances/container-instances-gpu

Pls check this and if it helps

Thanks

Deepanshu

Jørgen Bøndergaard Iversen 0 Reputation points

2025-11-12T11:36:05.86+00:00

Thanks for taking time to answer my question.

I already had waitForSuccess set to true on my start task (forgot to mention that).

I have tried the "--gpus all" flag, but then the task fails with an error saying that "--gpus all" is not supported, and from what I have read elsewhere this should not be needed, when using as container task.

The link you posted is for Azure Container instances and there GPU's are no longer supported.

I will try to increase the wait time.

Share via

Azure Batch Docker container doesn't have access to GPU when running task

1 answer

Recommended Fixes

Your answer