Urgent, need help, Docker cannot use A100 GPU !

Question

Urgent, need help, Docker cannot use A100 GPU !

Minghui Song 45 Microsoft Employee

I'm using A100 on Azure, and I finally successfully installed the GPU driver, ref: https://learn.microsoft.com/en-us/answers/questions/2237112/cannot-install-gpu-driver-535-a100-80g-gpus?comment=answer-2014527&page=1#comment-1981851

However, I cannot use my A100 when launch a Docker!

Urgent, need help.

Minghui Song 45 Reputation points Microsoft Employee

2025-03-21T11:17:35.46+00:00

this is the driver I'm using: https://www.nvidia.com/en-us/drivers/details/239666/
Anonymous

2025-03-24T13:58:03.6033333+00:00
Hi Minghui Song,

It looks like the GPU driver is installed, but Docker isn't recognizing the A100 GPU.

Follow these steps to enable GPU support in Docker:

Verify GPU availability

Run:

nvidia-smi

If the GPU is not detected, the driver may not be installed correctly.

Install NVIDIA Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \ && sudo apt update \ && sudo apt install -y nvidia-container-toolkit

Restart Docker:

sudo systemctl restart docker

Run a test container

docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

If this command returns GPU details, the issue is resolved.

Please refer to the below documentation Microsoft Docs

GPU in Docker https://learn.microsoft.com/en-us/azure/container-instances/container-instances-gpu

NVIDIA Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

If you have any further queries, please let us know we are glad to help you.If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Minghui Song 45 Reputation points Microsoft Employee

2025-03-24T15:31:17.48+00:00

same error:

looks like I can use gpu within docker by specifying: --device=/dev/nvidia0:/dev/nvidia0 ...
Anonymous

2025-03-25T13:53:59.8166667+00:00
Hi Minghui Song,

The error "Failed to initialize NVML: Unknown Error" typically indicates an issue with NVIDIA drivers, the NVIDIA Container Toolkit, or Secure Boot interference. Since --device=/dev/nvidia0:/dev/nvidia0 works, it suggests the NVIDIA runtime isn't correctly set up.

Try these steps:

Check NVIDIA Driver & NVML

nvidia-smi nvidia-container-cli --load-kmods info

If nvidia-smi fails, the driver might not be loaded properly. Try rebooting or reinstalling the driver.

Reconfigure NVIDIA Runtime for Docker

sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Then test again:

docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

Reinstall NVIDIA Container Toolkit (If needed)

sudo apt remove -y nvidia-container-toolkit sudo apt install -y nvidia-container-toolkit sudo systemctl restart docker ```If you have any further queries, please let us know we are glad to help you. If it was helpful, please click "Upvote" on this post to let us know. Thank You.
Minghui Song 45 Reputation points Microsoft Employee

2025-04-04T01:57:45.9133333+00:00

hi @ Geethasri.V Could you also tell me how to install the correct driver? I installed from this link: https://www.nvidia.com/en-us/drivers/details/239666/

And I met a new problem:

I want to complain that this system is too difficult to use, wasting time and money.

Accepted answer

0 additional answers

Your answer

Minghui Song 45 Reputation points Microsoft Employee

2025-03-21T11:17:35.46+00:00

this is the driver I'm using: https://www.nvidia.com/en-us/drivers/details/239666/
Anonymous

2025-03-24T13:58:03.6033333+00:00

Hi Minghui Song,

It looks like the GPU driver is installed, but Docker isn't recognizing the A100 GPU.

Follow these steps to enable GPU support in Docker:

Verify GPU availability

Run:

nvidia-smi

If the GPU is not detected, the driver may not be installed correctly.

Install NVIDIA Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \ && sudo apt update \ && sudo apt install -y nvidia-container-toolkit

Restart Docker:

sudo systemctl restart docker

Run a test container

docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

If this command returns GPU details, the issue is resolved.

Please refer to the below documentation Microsoft Docs

GPU in Docker https://learn.microsoft.com/en-us/azure/container-instances/container-instances-gpu

NVIDIA Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

If you have any further queries, please let us know we are glad to help you.If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Minghui Song 45 Reputation points Microsoft Employee

2025-03-24T15:31:17.48+00:00

same error:

looks like I can use gpu within docker by specifying: --device=/dev/nvidia0:/dev/nvidia0 ...
Anonymous

2025-03-25T13:53:59.8166667+00:00

Hi Minghui Song,

The error "Failed to initialize NVML: Unknown Error" typically indicates an issue with NVIDIA drivers, the NVIDIA Container Toolkit, or Secure Boot interference. Since --device=/dev/nvidia0:/dev/nvidia0 works, it suggests the NVIDIA runtime isn't correctly set up.

Try these steps:

Check NVIDIA Driver & NVML

nvidia-smi nvidia-container-cli --load-kmods info

If nvidia-smi fails, the driver might not be loaded properly. Try rebooting or reinstalling the driver.

Reconfigure NVIDIA Runtime for Docker

sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Then test again:

docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

Reinstall NVIDIA Container Toolkit (If needed)

sudo apt remove -y nvidia-container-toolkit sudo apt install -y nvidia-container-toolkit sudo systemctl restart docker ```If you have any further queries, please let us know we are glad to help you. If it was helpful, please click "Upvote" on this post to let us know. Thank You.
Minghui Song 45 Reputation points Microsoft Employee

2025-04-04T01:57:45.9133333+00:00

hi @ Geethasri.V Could you also tell me how to install the correct driver? I installed from this link: https://www.nvidia.com/en-us/drivers/details/239666/

And I met a new problem:

I want to complain that this system is too difficult to use, wasting time and money.

Answer 1

Hello Minghui Song,

I'm glad that you were able to find a workaround for the Docker GPU issue. Thank you for sharing that! It’s helpful for others in the community who might run into the same situation. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll post your workaround along with the steps that helped resolve the issue. This way, you can mark it as "Accepted " and help guide others who land on this thread.

Problem Summary- You successfully installed the NVIDIA driver for your A100 GPU (such as version 535), but running this Docker command- docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi failed with the error- Failed to initialize NVML: Unknown Error

However, you confirmed the GPU works inside the container only when manually mapping the device --device=/dev/nvidia0:/dev/nvidia0

This indicates that the NVIDIA driver is working, but the NVIDIA Docker runtime wasn’t properly configured.

How to fix it?

To resolve the issue and use the standard --gpus all flag, you need to ensure the NVIDIA Container Toolkit is properly installed and configured. First of all, confirm host GPU is recognized by running nvidia-smi on your host VM. If this fails, the GPU driver might not be installed correctly. Give a reboot to the VM as it helps if the install was recent.

Install NVIDIA Container Toolkit. This allows Docker to communicate with your GPU


distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \

  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \

  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update

sudo apt install -y nvidia-container-toolkit

Use the nvidia-ctk tool to configure NVIDIA as Docker's runtime


sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Now test with the official CUDA container:


docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

If everything is configured correctly, you should see your A100 GPU listed in the output.

I am sure you have already tried this step but still putting it here in the answer for anyone in the community have missed the step

If issues persist, try reinstalling the container toolkit-


sudo apt remove -y nvidia-container-toolkit

sudo apt install -y nvidia-container-toolkit

sudo systemctl restart docker

Note- if you’re using DKMS-based drivers, make sure secure boot is disabled on the Azure VM. You can check toolkit status using nvidia-container-cli --load-kmods info

And finally, the workaround which you have tried- manually exposing the devices like this will work temporarily-


docker run --rm --device=/dev/nvidia0:/dev/nvidia0 nvidia/cuda:12.2.0-base nvidia-smi

Thanks again for raising this and for sharing the workaround. I hope this summarized guide was helpful. Please feel free to add any points to enhance this answer. Thanks

Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-04-08T09:51:26.92+00:00

Hello Minghui Song, glad I was able to help you on this. It would be great if you could kindly accept my answer. Thanks

Share via

Urgent, need help, Docker cannot use A100 GPU !

0 additional answers

Your answer