Pytorch cannot detect GPU when using an AML Compute Cluster with a GPU

Claudia Vanea 1 Reputation point
2021-04-06T15:01:37.227+00:00

Hi,

I've been trying to train a pytorch model on the Azure ML compute clusters (STANDARD_NV6) but I cannot get the code to detect and use the GPU device, torch.cuda.is_available() always returns False.

I'm using a custom environment and have tried using a few different dockerfiles as base images from the Microsoft container repository. For example, I've tried the "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04" base image.

In the build log, I can see that the correct dependencies are installed each time but the code still doesn't detect a GPU. I tried forcing docker to use the GPU with docker_arguments = ["--gpus", "all"] but this causes the build to fail with this error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
    FailedContainerStart: Unable to start docker container
    err: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

    Reason: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

    Info: Failed to prepare an environment for the job execution: Job environment preparation failed on 10.0.0.5 with err exit status 1.

It feels like I've missed some obvious step somewhere...

Thanks for any help!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,333 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,826 Reputation points
    2021-04-07T10:11:49.583+00:00

    @Claudia Vanea Thanks for the question. which means driver issues, Can you please add more details about the Pytorch version that you using. Especially with pytorch where somehow the pytorch doesn’t install correctly with the latest CUDA drivers. Can you please try installing the latest nvdia drivers.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.