Unable to access GPU from Azure ML Component on an Azure ML Compute

Mark Friel 0 Reputation points
2024-01-11T10:39:04.39+00:00

I have had a draft pipeline on my Azure Machine Learning Studio for quite a while, which contains 4 components. This pipeline was linked to an azure ML compute that had a GPU, I believe it was a Tesla K80 but not entirely sure. In September the virtual machine family that the compute belonged to was deprecated. I provisioned a new compute instance, the Standard_NC4as_T4_v3. This has a tesla T4.

The issue that I have is that the model training component cannot detect the GPU on the machine. The environment for this component has not changed from when it was run on the previous compute. Whoich was able to detect the GPU and run as expected. I have also verified that the GPU Nvidia drivers are installed on the machine through running:

torch.cuda.is_available()

on a Jupyter notebook on the machine. I am using pytorch to train the model, from what I can research the versions that I am using are compatible with and the drivers and cuda toolkit on the machine.

The pytorch packages are being installed through conda and below are the relevant package versions:

- pytorch::pytorch==2.0.1
- pytorch::torchaudio==2.0.2
- pytorch::torchvision==0.15.2
- pytorch::torchtext==0.15.2
- pytorch::pytorch-cuda==11.8

The below is the details of the GPU on the machine gotten by running nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000001:00:00.0 Off |                    0 |
| N/A   34C    P0    27W /  70W |      0MiB / 15109MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any idea on what the issue may be would be much appreciated.

Thank you

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,115 questions
Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
8,336 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.