Nvidia Graphics card disconnects from Ubuntu OS after restart

Simranjeet Singh 6

Hi,

I deployed a VM image from Marketplace with Pytorch and Cuda preinstalled on an Ubuntu OS. At first setup, everything works fine and I'm able to detect NVIDIA GPU from within torch package and my code runs fine.
However, when I restart my VM, the connection between OS and GPU seems to be broken and my VM can no longer detect the graphics card. Only solution so far is to redeploy my VM and start from scratch.
Has anyone faced similar issue or if anyone can help me solving this weird problem?

Thanks

kobulloc-MSFT 23,646 Reputation points Microsoft Employee

2021-10-11T17:40:21.113+00:00

Hello, @Simranjeet Singh !

What marketplace image are you using and what VM SKU are you using?
Simranjeet Singh 6 Reputation points

2021-10-13T12:29:01.297+00:00

Thanks @Anonymous for getting back on this.

I was using AISE PyTorch GPU Notebook (Marketplace Link) image

and using VM SKU Standard NC6s_v2.

I know that the image has a bit old version of pytorch by default but I updated the versions and it seemed to work for me.

On first setup, everything works fine but as soon as I shutdown the VM and start it next day, I get the errors as attached. I googled around and this is my understanding that OS doesn't seem to communicate with GPU. For a few, restarting system worked but not for others. Even updating NVIDIA drivers didn't work.
Simranjeet Singh 6 Reputation points

2021-10-13T12:31:06.887+00:00

Attaching output of lspci command. NVIDIA part returns access denied.
kobulloc-MSFT 23,646 Reputation points Microsoft Employee

2021-10-14T03:25:04.747+00:00

Hello again, @Simranjeet Singh !

Unfortunately I'm unable to use some third party images like the Jetware image you linked but it sounds like there is a configuration change that happens on reboot (updates?). Are you able to get any information on what might have happened after the reboot from the VM's logs?
Simranjeet Singh 6 Reputation points

2021-10-14T14:21:37.897+00:00

Hi @Anonymous ,

I ran journalctl -b and dmesg on my VM to check boot logs but didnt find anything related to updates to OS or drivers.

Any other hints?

Thanks

1 answer

kobulloc-MSFT 23,646 Reputation points Microsoft Employee

2021-10-13T01:10:33.637+00:00
Hello, @Simranjeet Singh !

I haven't been able to run into the issues you are describing. This is the setup I used:

Image: NVIDIA GPU-Optimized PyTorch Image - v21.06.0 - Gen2 (Azure Marketplace link)

VM SKU: Standard_NV12s.v3

After restarting the VM, I'm still able to see the GPU:

I would try using the same image and see if you still encounter this issue.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment