Hello , Welcome to MSQ&A
- When you SSH into the node, the NVIDIA driver extension has already finished installing, and the
nvidia-container-runtimeis properly configured. - In Azure Batch, when the pool scales from zero, the start task and container task can overlap. If the container task starts before:
- The NVIDIA GPU driver extension finishes installing
- Docker is restarted with the correct
exec-optsand NVIDIA runtime settings then the container will not detect the GPU (nvidia-smifails inside container).
Recommended Fixes
- Use waitForSuccess on the Start Task in your pool configuration to ensure no tasks begin until the start task has completed successfully.
- Increase the wait time — 300 seconds may be too short, as the NVIDIA extension can take 5 to 10 minutes to initialize.
- Add GPU flag to container run options:
--gpus all --rm -v /mnt/batch/tasks/fsmounts/io:/mnt/io - Verify NVIDIA runtime: Ensure
/etc/docker/daemon.jsonincludes"default-runtime": "nvidia"and"runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }. - Consider pre-provisioned pool: Instead of scaling from zero, keep one node warm to avoid driver installation delays.
Best Solution - Microsoft recommends using Azure Batch with container pools configured for GPU and pre-installed drivers rather than relying on extensions at runtime.
Link --> https://learn.microsoft.com/en-us/azure/container-instances/container-instances-gpu
Pls check this and if it helps
Thanks
Deepanshu