@Claudia Vanea Thanks for the question. which means driver issues, Can you please add more details about the Pytorch version that you using. Especially with pytorch where somehow the pytorch doesn’t install correctly with the latest CUDA drivers. Can you please try installing the latest nvdia drivers.
Pytorch cannot detect GPU when using an AML Compute Cluster with a GPU
Hi,
I've been trying to train a pytorch model on the Azure ML compute clusters (STANDARD_NV6) but I cannot get the code to detect and use the GPU device, torch.cuda.is_available() always returns False.
I'm using a custom environment and have tried using a few different dockerfiles as base images from the Microsoft container repository. For example, I've tried the "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04" base image.
In the build log, I can see that the correct dependencies are installed each time but the code still doesn't detect a GPU. I tried forcing docker to use the GPU with docker_arguments = ["--gpus", "all"] but this causes the build to fail with this error:
AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
FailedContainerStart: Unable to start docker container
err: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
Reason: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
Info: Failed to prepare an environment for the job execution: Job environment preparation failed on 10.0.0.5 with err exit status 1.
It feels like I've missed some obvious step somewhere...
Thanks for any help!