failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Question

failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Mia Hu 1

Hi, I am trying to train a model on AZURE AML A100.

I have trained the same model on my GPU server before with tensorflow_gpu-1.15.5, python 3.7, Gcc 7.5.0, cuDNN 7.6.5 , cuda 10.0

I used a docker file to curated the same env, so I am sure it has tensorflow_gpu-1.15.5, python 3.7, cuDNN 7.6.5 , cuda 10.0. The only thing I am not sure is Gcc 7.5.0.

However, I am keeping getting the error message

start training
2021-09-08 18:18:21.911226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-09-08 18:58:09.545333: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2021-09-08 18:58:09.545451: I tensorflow/stream_executor/stream.cc:1990] [stream=0x55ae14908d10,impl=0x55ae14907470] did not wait for [stream=0x55ae14908a90,impl=0x55ae149074a0]
2021-09-08 18:58:09.545478: F tensorflow/core/common_runtime/gpu/gpu_util.cc:342] CPU->GPU Memcpy failed
2021-09-08 18:58:09.545528: I tensorflow/stream_executor/stream.cc:4938] [stream=0x55ae14908d10,impl=0x55ae14907470] did not memcpy host-to-device; source: 0x7fe9f749b000
2021-09-08 18:58:09.545529: I tensorflow/stream_executor/stream.cc:4938] [stream=0x55ae14908d10,impl=0x55ae14907470] did not memcpy host-to-device; source: 0x7fe9f74a1600
bash: line 1: 96 Aborted (core dumped) python $AZ_BATCHAI_JOB_TEMP/azureml/hydranet_prod_base_tf_1_15_5_1631122651_ba72eb11/azureml-setup/context_manager_injector.py "-i" "ProjectPythonPath:context_managers.ProjectPythonPath" "-i" "Dataset:context_managers.Datasets" "-i" "RunHistory:context_managers.RunHistory" "-i" "TrackUserError:context_managers.TrackUserError" "-i" "UserExceptions:context_managers.UserExceptions" "main_aml.py" "--note" "hydranet_prod_base_tf_1_15_5" "--mount_path" "DatasetConsumptionConfig:data_folder" "--conf" "conf/aml.conf" "--job" "train" "--in_train_feat_path" "account_clean_sq_distribution.10.null.feat.jsonl|opportunity_clean_sq_distribution.10.null.feat.jsonl|contact_clean_sq_distribution.10.null.feat.jsonl|lead_clean_sq_distribution.10.null.feat.jsonl|customerprofile_clean_sq_distribution.10.null.feat.jsonl|train.prioritysetexact.account.noise-level-0.feat.jsonl|train.prioritysetexact.opportunity.noise-level-0.feat.jsonl|train.prioritysetexact.contact.noise-level-0.feat.jsonl|train.prioritysetexact.lead.noise-level-0.feat.jsonl" "--in_dev_feat_path" "measurement.contact.noise-level-0.20201202.feat.jsonl" "--in_wikisql_train_feat_path" "wikisql.train.noise-level-0.20201202.feat.jsonl" "--out_subset_feat_dir" "outputs\subset_feat_dir" "--in_bert_root_path" "DatasetConsumptionConfig:base_model_folder"
2021/09/08 18:58:11 Skipping parsing control script error. Reason: Error json file doesn't exist. This most likely means that no errors were written to the file. File path: /mnt/batch/tasks/workitems/6670d3e3-260e-46c2-bdb4-8fb42942abe0/job-1/hydranet_prod_base_t_82b9b21d-3ac5-4f55-a692-5b84119e9daa/wd/runTaskLetTask_error.json
2021/09/08 18:58:11 Wrapper cmd failed with err: exit status 134
2021/09/08 18:58:11 Attempt 1 of http call to http://10.0.0.19:16384/sendlogstoartifacts/status
2021/09/08 18:58:11 Send process info logs to master server succeeded
2021/09/08 18:58:11 mpirun version string: {
mpirun (Open MPI) 3.1.2
Report bugs to http://www.open-mpi.org/community/help/
}
2021/09/08 18:58:11 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 2
FilteredData: 0.
2021/09/08 18:58:11 Process Exiting with Code: 134
2021/09/08 18:58:11 All App Insights Logs was sent successfully or the close timeout of 10 was reached

I searched the message in the title, and the common solution of adding

config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
config.gpu_options.polling_inactive_delay_msecs = 10
session = tf.compat.v1.Session(config=config)

did not work.

One final idea is that this set-up does not work on RTX 3xxx. However, I am not sure what kind of GPU Azure is using.

Could anyone help? Thank you!!

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2021-09-09T06:07:37.513+00:00
@Mia Hu I believe you are referring to your local GPU machine for the earlier training that is performed. The error in question is while running the training or experiment on Azure ML compute. Are you using GPU enabled compute for this experiment like the Standard_NC6 version?

For the docker configuration are you using the standard GPU image that is available from Azure ML? This is the most likely configuration that is defined for most GPU based training or inference.

from azureml.core.environment import Environment, DEFAULT_GPU_IMAGE myenv = Environment.from_conda_specification(name="myenv", file_path="myenv.yml") myenv.docker.base_image = DEFAULT_GPU_IMAGE

Defining the required packages in the myenv.yml including tensorflow-gpu should suffice.

name: project_environment dependencies: # The python interpreter version. # Currently Azure ML only supports 3.5.2 and later. - python=3.6.2 - pip: # You must list azureml-defaults as a pip dependency - azureml-defaults>=1.0.45 - numpy - tensorflow-gpu=1.12 channels: - conda-forge
Mia Hu 1 Reputation point

2021-09-09T17:53:19.777+00:00

@romungi-MSFT Thanks for your reply.

Yes, previous trainings are on our local GPU machines, and the one with problem is Azure ML.
It is a AML GPU cluster, which contains 8 machines with 8 A100 cards per machine.
For docker, there is no standard GPU image that is available from Azure ML meeting my needs. I saw a tensorflow 1.15.5 CPU version image on Azure but I need GPU version.
I write docker files and upload to Azure and build it there. This is one of the dockerfiles that builds successfully but my experiment using this env still failed.
Mia Hu 1 Reputation point

2021-09-09T17:54:37.487+00:00

Dockerfile here:

FROM mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04

ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/tensorflow-1.15

ENV CUDNN_VERSION 7.6.5.32

RUN apt-get update && apt-get install -y --no-install-recommends \
libcudnn7=$CUDNN_VERSION-1+cuda10.0 \
libcudnn7-dev=$CUDNN_VERSION-1+cuda10.0 && \
apt-mark hold libcudnn7 && \
rm -rf /var/lib/apt/lists/*

Create conda environment

RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \
python=3.7 \
pip=20.2.4 \
cudatoolkit \
-c anaconda -c conda-forge

Prepend path to AzureML conda environment

ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH

Install pip dependencies

RUN HOROVOD_WITH_TENSORFLOW=1 \
pip install 'matplotlib>=3.3,<3.4' \
'psutil>=5.8,<5.9' \
'tqdm>=4.59,<4.60' \
'pandas>=1.1,<1.2' \
'scipy>=1.5,<1.6' \
'numpy>=1.10,<1.20' \
'azureml-core==1.30.0' \
'azureml-defaults==1.30.0' \
'azureml-mlflow==1.30.0' \
'azureml-telemetry==1.30.0' \
'onnxruntime-gpu>=1.7,<1.8' \
'tensorflow-gpu==1.15.5' \
'future==0.17.1' \
'azureml' \
'jsonlines==2.0.0' \
'gast==0.2.2' \
'onnx==1.7'

This is needed for mpi to locate libpython

ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH

Your answer

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2021-09-09T06:07:37.513+00:00

@Mia Hu I believe you are referring to your local GPU machine for the earlier training that is performed. The error in question is while running the training or experiment on Azure ML compute. Are you using GPU enabled compute for this experiment like the Standard_NC6 version?

For the docker configuration are you using the standard GPU image that is available from Azure ML? This is the most likely configuration that is defined for most GPU based training or inference.

from azureml.core.environment import Environment, DEFAULT_GPU_IMAGE myenv = Environment.from_conda_specification(name="myenv", file_path="myenv.yml") myenv.docker.base_image = DEFAULT_GPU_IMAGE

Defining the required packages in the myenv.yml including tensorflow-gpu should suffice.

name: project_environment dependencies: # The python interpreter version. # Currently Azure ML only supports 3.5.2 and later. - python=3.6.2 - pip: # You must list azureml-defaults as a pip dependency - azureml-defaults>=1.0.45 - numpy - tensorflow-gpu=1.12 channels: - conda-forge
Mia Hu 1 Reputation point

2021-09-09T17:53:19.777+00:00

@romungi-MSFT Thanks for your reply.

Yes, previous trainings are on our local GPU machines, and the one with problem is Azure ML.
It is a AML GPU cluster, which contains 8 machines with 8 A100 cards per machine.
For docker, there is no standard GPU image that is available from Azure ML meeting my needs. I saw a tensorflow 1.15.5 CPU version image on Azure but I need GPU version.
I write docker files and upload to Azure and build it there. This is one of the dockerfiles that builds successfully but my experiment using this env still failed.
Mia Hu 1 Reputation point

2021-09-09T17:54:37.487+00:00

Dockerfile here:

FROM mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04

ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/tensorflow-1.15

ENV CUDNN_VERSION 7.6.5.32

RUN apt-get update && apt-get install -y --no-install-recommends \
libcudnn7=$CUDNN_VERSION-1+cuda10.0 \
libcudnn7-dev=$CUDNN_VERSION-1+cuda10.0 && \
apt-mark hold libcudnn7 && \
rm -rf /var/lib/apt/lists/*

Create conda environment

RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \
python=3.7 \
pip=20.2.4 \
cudatoolkit \
-c anaconda -c conda-forge

Prepend path to AzureML conda environment

ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH

Install pip dependencies

RUN HOROVOD_WITH_TENSORFLOW=1 \
pip install 'matplotlib>=3.3,<3.4' \
'psutil>=5.8,<5.9' \
'tqdm>=4.59,<4.60' \
'pandas>=1.1,<1.2' \
'scipy>=1.5,<1.6' \
'numpy>=1.10,<1.20' \
'azureml-core==1.30.0' \
'azureml-defaults==1.30.0' \
'azureml-mlflow==1.30.0' \
'azureml-telemetry==1.30.0' \
'onnxruntime-gpu>=1.7,<1.8' \
'tensorflow-gpu==1.15.5' \
'future==0.17.1' \
'azureml' \
'jsonlines==2.0.0' \
'gast==0.2.2' \
'onnx==1.7'

This is needed for mpi to locate libpython

ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH

Share via

failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Create conda environment

Prepend path to AzureML conda environment

Install pip dependencies

This is needed for mpi to locate libpython

Your answer