Urgent, need help, Docker cannot use A100 GPU !

Minghui Song 45 Reputation points Microsoft Employee
2025-03-21T10:55:00.66+00:00

I'm using A100 on Azure, and I finally successfully installed the GPU driver, ref: https://learn.microsoft.com/en-us/answers/questions/2237112/cannot-install-gpu-driver-535-a100-80g-gpus?comment=answer-2014527&page=1#comment-1981851

However, I cannot use my A100 when launch a Docker!

Urgent, need help.

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
9,013 questions
{count} votes

Accepted answer
  1. Arko 4,150 Reputation points Microsoft External Staff Moderator
    2025-04-07T07:31:59.9866667+00:00

    Hello Minghui Song,

    I'm glad that you were able to find a workaround for the Docker GPU issue. Thank you for sharing that! It’s helpful for others in the community who might run into the same situation. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll post your workaround along with the steps that helped resolve the issue. This way, you can mark it as "Accepted " and help guide others who land on this thread.

    Problem Summary- You successfully installed the NVIDIA driver for your A100 GPU (such as version 535), but running this Docker command- docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi failed with the error- Failed to initialize NVML: Unknown Error

    However, you confirmed the GPU works inside the container only when manually mapping the device --device=/dev/nvidia0:/dev/nvidia0

    This indicates that the NVIDIA driver is working, but the NVIDIA Docker runtime wasn’t properly configured.

    How to fix it?

    To resolve the issue and use the standard --gpus all flag, you need to ensure the NVIDIA Container Toolkit is properly installed and configured. First of all, confirm host GPU is recognized by running nvidia-smi on your host VM. If this fails, the GPU driver might not be installed correctly. Give a reboot to the VM as it helps if the install was recent.

    Install NVIDIA Container Toolkit. This allows Docker to communicate with your GPU

    
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    
    curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
    
      | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    
      | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
    sudo apt update
    
    sudo apt install -y nvidia-container-toolkit
    
    

    Use the nvidia-ctk tool to configure NVIDIA as Docker's runtime

    
    sudo nvidia-ctk runtime configure --runtime=docker
    
    sudo systemctl restart docker
    
    

    Now test with the official CUDA container:

    
    docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi
    
    

    If everything is configured correctly, you should see your A100 GPU listed in the output.

    I am sure you have already tried this step but still putting it here in the answer for anyone in the community have missed the step

    If issues persist, try reinstalling the container toolkit-

    
    sudo apt remove -y nvidia-container-toolkit
    
    sudo apt install -y nvidia-container-toolkit
    
    sudo systemctl restart docker
    
    

    Note- if you’re using DKMS-based drivers, make sure secure boot is disabled on the Azure VM. You can check toolkit status using nvidia-container-cli --load-kmods info

    And finally, the workaround which you have tried- manually exposing the devices like this will work temporarily-

    
    docker run --rm --device=/dev/nvidia0:/dev/nvidia0 nvidia/cuda:12.2.0-base nvidia-smi
    
    

    Thanks again for raising this and for sharing the workaround. I hope this summarized guide was helpful. Please feel free to add any points to enhance this answer. Thanks

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.