Hello Minghui Song,
I'm glad that you were able to find a workaround for the Docker GPU issue. Thank you for sharing that! It’s helpful for others in the community who might run into the same situation. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll post your workaround along with the steps that helped resolve the issue. This way, you can mark it as "Accepted " and help guide others who land on this thread.
Problem Summary- You successfully installed the NVIDIA driver for your A100 GPU (such as version 535), but running this Docker command- docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi
failed with the error- Failed to initialize NVML: Unknown Error
However, you confirmed the GPU works inside the container only when manually mapping the device --device=/dev/nvidia0:/dev/nvidia0
This indicates that the NVIDIA driver is working, but the NVIDIA Docker runtime wasn’t properly configured.
How to fix it?
To resolve the issue and use the standard --gpus all
flag, you need to ensure the NVIDIA Container Toolkit is properly installed and configured. First of all, confirm host GPU is recognized by running nvidia-smi
on your host VM. If this fails, the GPU driver might not be installed correctly. Give a reboot to the VM as it helps if the install was recent.
Install NVIDIA Container Toolkit. This allows Docker to communicate with your GPU
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
Use the nvidia-ctk
tool to configure NVIDIA as Docker's runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Now test with the official CUDA container:
docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi
If everything is configured correctly, you should see your A100 GPU listed in the output.
I am sure you have already tried this step but still putting it here in the answer for anyone in the community have missed the step
If issues persist, try reinstalling the container toolkit-
sudo apt remove -y nvidia-container-toolkit
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
Note- if you’re using DKMS-based drivers, make sure secure boot is disabled on the Azure VM. You can check toolkit status using nvidia-container-cli --load-kmods info
And finally, the workaround which you have tried- manually exposing the devices like this will work temporarily-
docker run --rm --device=/dev/nvidia0:/dev/nvidia0 nvidia/cuda:12.2.0-base nvidia-smi
Thanks again for raising this and for sharing the workaround. I hope this summarized guide was helpful. Please feel free to add any points to enhance this answer. Thanks