Cannot get GPU tensorflow to work on Azure ML Compute Instance

Question

I am experimenting with constructing some DNNs in a notebook running in Azure Machine Learning Studio. In order to speed up model training in tensorflow/keras I want to utilize the GPU of my compute instance. However, upon importing tensorflow in my notebook, I get the following error:

I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

I am running my notebook on a STANDARD_NC4AS_T4_V3 compute instance, which does have a GPU (also confirmed by running the nvidia-smi command in the terminal, showing CUDA version 11.4).

I am using the vanilla Python 3.8 - Pytorch and Tensorflow environment that comes with the compute instance. I have not attempted to install any additional packages in the environment. I have tried the same also with the remaining environments that are available by default on the compute instance.

The installed version of tensorflow = 2.11, and the available version of CUDA=11.4, which should be compatible as far as I can tell.

Please advice how can I enable GPU training on such a compute instance?

Is there a way to rebuild tensorflow with the right compiler flags (and what are they)?

Answer

After some back-and-forth with Microsoft Support, the core issue was identified to be a mismatch between the CUDA drivers, CUDA toolkit and Tensorflow on the image shipped with the compute instances. Word was that within a week or two from writing this post (24/01/2024) a new image should be pushed that solves the issue. This may require spinning up a fresh compute instance, though.

In the meanwhile, the following can be used as a workaround:

Open a terminal on your GPU compute instance. Do the following:

$ conda create --name -c conda-forge tensorflow-gpu=2.10 ipykernel=6

$ conda activate

$ python -m ipykernel install --user --name --display-name ""

Then open your notebook and use your newly created kernel to run your code. This enabled me to run my notebook training using the GPU.

Answer

@aot I have found a previous workaround that was shared for a similar issue that could help in this case. Try using the following steps:

Install the same version of cuda 11.4 https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
If you get the following error:
The following packages have unmet dependencies:

     The following packages have unmet dependencies:

     cuda : Depends: cuda-11-4 (>= 11.4.0) but it is not going to be installed

     E: Unable to correct problems, you have held broken packages.

then do the following:

    sudo apt-get install aptitude
    sudo aptitude install cuda-11-4     
    Do here N and then Y

Create the conda env and perform the following:

conda create --name tf python=3.8

If you are still seeing an issue after the workaround, I think it might be easier to report the issue through azure support case with details of the instance and the region that is being used. This would help the service team fix the image used by this instance. Thanks!!

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Cannot get GPU tensorflow to work on Azure ML Compute Instance

2 answers

Your answer