Pytorch cannot detect GPU when using an AML Compute Cluster with a GPU

Question

Pytorch cannot detect GPU when using an AML Compute Cluster with a GPU

Claudia Vanea 1

Hi,

I've been trying to train a pytorch model on the Azure ML compute clusters (STANDARD_NV6) but I cannot get the code to detect and use the GPU device, torch.cuda.is_available() always returns False.

I'm using a custom environment and have tried using a few different dockerfiles as base images from the Microsoft container repository. For example, I've tried the "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04" base image.

In the build log, I can see that the correct dependencies are installed each time but the code still doesn't detect a GPU. I tried forcing docker to use the GPU with docker_arguments = ["--gpus", "all"] but this causes the build to fail with this error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
    FailedContainerStart: Unable to start docker container
    err: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

    Reason: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

    Info: Failed to prepare an environment for the job execution: Job environment preparation failed on 10.0.0.5 with err exit status 1.

It feels like I've missed some obvious step somewhere...

Thanks for any help!

1 answer

Your answer

Answer 1

Ramr-msft 17,826

@Claudia Vanea Thanks for the question. which means driver issues, Can you please add more details about the Pytorch version that you using. Especially with pytorch where somehow the pytorch doesn’t install correctly with the latest CUDA drivers. Can you please try installing the latest nvdia drivers.

Claudia Vanea 1 Reputation point

2021-04-07T11:13:08.267+00:00
Thanks for getting back to me!

I am using this yaml file to create the environment which installs pytorch 1.7.1 and cudatoolkit 10.2.89:

name: pytorch-env channels: - defaults - pytorch dependencies: - python=3.8 - pytorch - torchvision - cudatoolkit=10.2 - pandas - scikit-learn - pip - pip: - azureml-sdk

I don't know how to install/update nvidia drivers on an AML compute cluster, perhaps this is the problem? How would I go about adding this step to the build?

Share via

Pytorch cannot detect GPU when using an AML Compute Cluster with a GPU

1 answer

Your answer