Cannot get GPU tensorflow to work on Azure ML Compute Instance

aot 66 Reputation points
2024-01-10T13:19:58.3333333+00:00

I am experimenting with constructing some DNNs in a notebook running in Azure Machine Learning Studio. In order to speed up model training in tensorflow/keras I want to utilize the GPU of my compute instance. However, upon importing tensorflow in my notebook, I get the following error:

I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

I am running my notebook on a STANDARD_NC4AS_T4_V3 compute instance, which does have a GPU (also confirmed by running the nvidia-smi command in the terminal, showing CUDA version 11.4).

I am using the vanilla Python 3.8 - Pytorch and Tensorflow environment that comes with the compute instance. I have not attempted to install any additional packages in the environment. I have tried the same also with the remaining environments that are available by default on the compute instance.

The installed version of tensorflow = 2.11, and the available version of CUDA=11.4, which should be compatible as far as I can tell.

Please advice how can I enable GPU training on such a compute instance?

Is there a way to rebuild tensorflow with the right compiler flags (and what are they)?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,889 questions
{count} votes

2 answers

Sort by: Most helpful
  1. aot 66 Reputation points
    2024-01-24T08:22:47.48+00:00

    After some back-and-forth with Microsoft Support, the core issue was identified to be a mismatch between the CUDA drivers, CUDA toolkit and Tensorflow on the image shipped with the compute instances. Word was that within a week or two from writing this post (24/01/2024) a new image should be pushed that solves the issue. This may require spinning up a fresh compute instance, though.

    In the meanwhile, the following can be used as a workaround:

    Open a terminal on your GPU compute instance. Do the following:

    $ conda create --name <conda_env_name> -c conda-forge tensorflow-gpu=2.10 ipykernel=6

    $ conda activate <conda_env_name>

    $ python -m ipykernel install --user --name <conda_env_name> --display-name "<jupyter_kernel_display_name>"

    Then open your notebook and use your newly created kernel to run your code. This enabled me to run my notebook training using the GPU.

    5 people found this answer helpful.

  2. romungi-MSFT 45,731 Reputation points Microsoft Employee
    2024-01-11T04:35:56.6833333+00:00

    @aot I have found a previous workaround that was shared for a similar issue that could help in this case. Try using the following steps:

    1. Install the same version of cuda 11.4 https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
    2. If you get the following error:
      The following packages have unmet dependencies:
         The following packages have unmet dependencies:
    
         cuda : Depends: cuda-11-4 (>= 11.4.0) but it is not going to be installed
    
         E: Unable to correct problems, you have held broken packages.
    
    
    

    then do the following:

        sudo apt-get install aptitude
        sudo aptitude install cuda-11-4     
        Do here N and then Y
    

    Create the conda env and perform the following:

    conda create --name tf python=3.8
    
    
    

    If you are still seeing an issue after the workaround, I think it might be easier to report the issue through azure support case with details of the instance and the region that is being used. This would help the service team fix the image used by this instance. Thanks!!

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.