NVIDIA Driver not detected

Nicholas Broad 10 Reputation points
2023-06-28T03:07:07.6366667+00:00

I am using the "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04" image to train transformers using an Azure ML pipeline. A couple weeks ago I set it up to run on V100s and it did fine. I'm trying now to run it on T4 and A10 and I am getting the following error:

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.

I then tried using some of the acpt pytorch images but those also failed.

Failed to execute command group with error Docker responded with status code 500: {"message":"failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver not loaded: unknown"}

It seems I need to manually install the drivers, which seems like it defeats the purpose of these images that Azure recommends using.

I'll try following these steps, but I wanted to post because I am anticipating that my approach will fail.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,340 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Nicholas Broad 10 Reputation points
    2023-06-28T21:47:42.6533333+00:00

    I decided to switch to the nvidia ngc pytorch container (23.04) and it worked just fine.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.