Nvidia drivers not working on DSVM

Sebastian Lienert 5 Reputation points
2023-11-16T08:07:00.7066667+00:00

Hi,

I'm trying to set up a VM with CUDA installed and figured I would go with the DSVM image, since according to specification it should work out of the box.

However when I connect to my VM (NC6s v3) and execute nvdia-smi i get:

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I have tried running apt upgrade and reinstalling the cuda-drives to no avail. I also created a new VM this morning with exactly the same problem.

If I try installing the "NvidiaGpuDriverLinux" Extension it also fails.

Any advice is appreciated, my next approach would be to start from a clean, non-DSVM image and install the drivers myself.

Azure Data Science Virtual Machines
Azure Data Science Virtual Machines
Azure Virtual Machine images that are pre-installed, configured, and tested with several commonly used tools for data analytics, machine learning, and artificial intelligence training.
69 questions
0 comments No comments
{count} vote

2 answers

Sort by: Most helpful
  1. YutongTie-MSFT 53,211 Reputation points
    2023-11-16T19:36:44.41+00:00

    @Sebastian Lienert

    Thanks for reaching out to us. There could be several reasons why you're experiencing this issue. Here are a few troubleshooting steps you can follow:

    1. Check the VM size: Not all VMs in Azure support GPU acceleration. Make sure that you're using a VM size that supports GPUs. NC6s v3 should support GPUs, so this shouldn't be an issue.
    2. Check the CUDA version: It's possible that the CUDA version installed on your DSVM is not compatible with the GPU on your VM. You can check the CUDA version with the command nvcc --version. The CUDA version should be compatible with the NVIDIA driver version.
    3. Reinstall the NVIDIA driver: You mentioned that you have tried reinstalling the CUDA drivers. You can try reinstalling the NVIDIA drivers as well. Here's how:
      • Uninstall the current driver: sudo apt-get remove --purge nvidia-*
        • Update the system: sudo apt-get update
          • Install the NVIDIA driver: sudo apt-get install nvidia-driver-xxx (replace xxx with the version you want)
    4. Check the NVIDIA Kernel Module: Sometimes, the NVIDIA kernel module is not loaded correctly, which can cause issues. You can check if the NVIDIA kernel module is loaded with the command lsmod | grep nvidia. If it's not loaded, you can load it with the command sudo modprobe nvidia.
    5. Check for any system updates: Sometimes, system updates can cause issues with the NVIDIA drivers. Make sure your system is up to date.

    If none of these steps work, then you might want to consider starting from a clean, non-DSVM image and installing the drivers yourself. Make sure to follow the official NVIDIA installation guides to ensure that the drivers are installed correctly.

    I hope this helps.

    Regards,

    Yutong

    0 comments No comments

  2. TOMOIAGA Ciprian 1 Reputation point
    2024-01-12T18:13:40.6666667+00:00

    @YutongTie-MSFT Thanks for the advice. Running sudo modprobe nvidia fails with modprobe: ERROR: could not insert 'nvidia': Operation not permitted . This is probably because SecureBoot is enabled and the driver taints the kernel (is not signed, or something like that). To be honest, the reason I use a DSVM is for everything to work out of the box. instead, I am first greeted with conda: command not found , and I need to press Ctrl+C to resume bash. And then Nvidia drivers fails. If I wanted this headaches, I'd get a vanilla VM

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.