NC6 DSVM Ubuntu 18.04 Gen 1 - NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Suchi 131 Reputation points
2021-12-10T02:39:09.58+00:00

I created a NC6 GPU with Ubuntu DSVM Gen 1 image. When I try nvidia-smi from ssh terminal I get the error

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This used to work fine even few weeks ago. I am noticing this strange behavior only recently. I have tried to create new VMs many times but this issue still exists every time. Can you please help? Has anything changed?

Azure Data Science Virtual Machines
Azure Data Science Virtual Machines
Azure Virtual Machine images that are pre-installed, configured, and tested with several commonly used tools for data analytics, machine learning, and artificial intelligence training.
66 questions
0 comments No comments
{count} votes

Accepted answer
  1. Suchi 131 Reputation points
    2021-12-10T04:55:43.293+00:00

    I found the solution and I have wasted more than a day on this. I really wish Azure team can provide correct versions bundled with DSVMs

    The NVIDIA library that is bundled along with Ubuntu 18.04 Gen 1 DSVM is Nvidia-495. This is not supported by Ubuntu 18.04. I had to try various installs, refreshes, network settings etc. to arrive at this junction. Finally I found in syslog that Ubuntu 18.04 was ignoring library 495 and hence GPU is not loaded.

    Then I had to do a lot of circus to remove 495 in a clean manner and install 470 which is supported by Ubuntu 18.04 and then it worked.

    Meanwhile, the DSVM page in Microsoft also mentions that K80 (NC6) machines are loaded with 470 drivers. However in practice I found that it comes with 495 which was the root cause of this issue.

    Can someone from Microsoft Azure team, please update your image for DSVM such that it comes preloaded with NVIDIA driver 470 instead of 495?

    2 people found this answer helpful.

3 additional answers

Sort by: Most helpful
  1. Vijay P 11 Reputation points
    2021-12-14T15:34:50.363+00:00

    I can confirm I have the same problem with an N6 DSVM. I have Ubuntu 20.04 and Nvidia 495 drivers - which are supported and installed by default. nvidia-smi fails on my machine too. Any solution or workaround would be helpful.

    2 people found this answer helpful.

  2. Luis Molina Martinez 126 Reputation points
    2023-01-17T08:22:45.1333333+00:00

    I'm having the same issue, it seems related also to the iotedge defender, I'm using iotedge framework in a DSVM, and I got regular crash reports about that, and when it happens, nvidia-smi stops to work, and when trying to update the drivers I got an error from this module and it didn't install.

    1 person found this answer helpful.
    0 comments No comments

  3. Sébastien Perin 36 Reputation points
    2023-04-26T10:20:53.35+00:00

    Same issue here after manually installed the CUDA driver on NC6 Ubuntu 20.04 VM following the documentation. Latest installed driver was cuda-drivers=530.30.02-1. However dmesg says:

    [   53.252539] NVRM: The NVIDIA Tesla K80 GPU installed in this system is
                   NVRM:  supported through the NVIDIA 470.xx Legacy drivers. Please
                   NVRM:  visit http://www.nvidia.com/object/unix.html for more
                   NVRM:  information.  The 525.105.17 NVIDIA driver will ignore
                   NVRM:  this GPU.  Continuing probe...
    
    

    So I downgraded to cuda-drivers=470.182.03-1 and it works.

    0 comments No comments