I decided to switch to the nvidia ngc pytorch container (23.04) and it worked just fine.
NVIDIA Driver not detected
I am using the "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04" image to train transformers using an Azure ML pipeline. A couple weeks ago I set it up to run on V100s and it did fine. I'm trying now to run it on T4 and A10 and I am getting the following error:
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
I then tried using some of the acpt pytorch images but those also failed.
Failed to execute command group with error Docker responded with status code 500: {"message":"failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver not loaded: unknown"}
It seems I need to manually install the drivers, which seems like it defeats the purpose of these images that Azure recommends using.
I'll try following these steps, but I wanted to post because I am anticipating that my approach will fail.