NVIDIA Driver not detected

Question

NVIDIA Driver not detected

Nicholas Broad 10

I am using the "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04" image to train transformers using an Azure ML pipeline. A couple weeks ago I set it up to run on V100s and it did fine. I'm trying now to run it on T4 and A10 and I am getting the following error:

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.

I then tried using some of the acpt pytorch images but those also failed.

Failed to execute command group with error Docker responded with status code 500: {"message":"failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver not loaded: unknown"}

It seems I need to manually install the drivers, which seems like it defeats the purpose of these images that Azure recommends using.

I'll try following these steps, but I wanted to post because I am anticipating that my approach will fail.

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-06-28T10:09:53.76+00:00

@Nicholas Broad Looking around for the error seems to suggest that it could be a platform issue rather than issue with the image itself. Please see this issue where the user suggests to run an upgrade on the machine in the recent comment.

sudo apt update and sudo apt upgrade

As mentioned, it also seems to work on a different VM configuration. If the issue persists you can raise an issue with the Azure ML containers repo, the Dockerfile for your configuration is here for reference.
Poda Csanad 0 Reputation points

2023-12-18T20:07:05.2733333+00:00

romungi-MSFT tbh I'm facing the same issue even after running apt update and upgrade. This is on mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04 with the A100.

1 answer

Your answer

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-06-28T10:09:53.76+00:00

@Nicholas Broad Looking around for the error seems to suggest that it could be a platform issue rather than issue with the image itself. Please see this issue where the user suggests to run an upgrade on the machine in the recent comment.

sudo apt update and sudo apt upgrade

As mentioned, it also seems to work on a different VM configuration. If the issue persists you can raise an issue with the Azure ML containers repo, the Dockerfile for your configuration is here for reference.
Poda Csanad 0 Reputation points

2023-12-18T20:07:05.2733333+00:00

romungi-MSFT tbh I'm facing the same issue even after running apt update and upgrade. This is on mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04 with the A100.

Answer 1

Nicholas Broad 10

I decided to switch to the nvidia ngc pytorch container (23.04) and it worked just fine.

Share via

NVIDIA Driver not detected

1 answer

Your answer