failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Mia Hu 1 Reputation point
2021-09-08T19:22:06.847+00:00

Hi, I am trying to train a model on AZURE AML A100.

I have trained the same model on my GPU server before with tensorflow_gpu-1.15.5, python 3.7, Gcc 7.5.0, cuDNN 7.6.5 , cuda 10.0

I used a docker file to curated the same env, so I am sure it has tensorflow_gpu-1.15.5, python 3.7, cuDNN 7.6.5 , cuda 10.0. The only thing I am not sure is Gcc 7.5.0.

However, I am keeping getting the error message

start training
2021-09-08 18:18:21.911226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-09-08 18:58:09.545333: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2021-09-08 18:58:09.545451: I tensorflow/stream_executor/stream.cc:1990] [stream=0x55ae14908d10,impl=0x55ae14907470] did not wait for [stream=0x55ae14908a90,impl=0x55ae149074a0]
2021-09-08 18:58:09.545478: F tensorflow/core/common_runtime/gpu/gpu_util.cc:342] CPU->GPU Memcpy failed
2021-09-08 18:58:09.545528: I tensorflow/stream_executor/stream.cc:4938] [stream=0x55ae14908d10,impl=0x55ae14907470] did not memcpy host-to-device; source: 0x7fe9f749b000
2021-09-08 18:58:09.545529: I tensorflow/stream_executor/stream.cc:4938] [stream=0x55ae14908d10,impl=0x55ae14907470] did not memcpy host-to-device; source: 0x7fe9f74a1600
bash: line 1: 96 Aborted (core dumped) python $AZ_BATCHAI_JOB_TEMP/azureml/hydranet_prod_base_tf_1_15_5_1631122651_ba72eb11/azureml-setup/context_manager_injector.py "-i" "ProjectPythonPath:context_managers.ProjectPythonPath" "-i" "Dataset:context_managers.Datasets" "-i" "RunHistory:context_managers.RunHistory" "-i" "TrackUserError:context_managers.TrackUserError" "-i" "UserExceptions:context_managers.UserExceptions" "main_aml.py" "--note" "hydranet_prod_base_tf_1_15_5" "--mount_path" "DatasetConsumptionConfig:data_folder" "--conf" "conf/aml.conf" "--job" "train" "--in_train_feat_path" "account_clean_sq_distribution.10.null.feat.jsonl|opportunity_clean_sq_distribution.10.null.feat.jsonl|contact_clean_sq_distribution.10.null.feat.jsonl|lead_clean_sq_distribution.10.null.feat.jsonl|customerprofile_clean_sq_distribution.10.null.feat.jsonl|train.prioritysetexact.account.noise-level-0.feat.jsonl|train.prioritysetexact.opportunity.noise-level-0.feat.jsonl|train.prioritysetexact.contact.noise-level-0.feat.jsonl|train.prioritysetexact.lead.noise-level-0.feat.jsonl" "--in_dev_feat_path" "measurement.contact.noise-level-0.20201202.feat.jsonl" "--in_wikisql_train_feat_path" "wikisql.train.noise-level-0.20201202.feat.jsonl" "--out_subset_feat_dir" "outputs\subset_feat_dir" "--in_bert_root_path" "DatasetConsumptionConfig:base_model_folder"
2021/09/08 18:58:11 Skipping parsing control script error. Reason: Error json file doesn't exist. This most likely means that no errors were written to the file. File path: /mnt/batch/tasks/workitems/6670d3e3-260e-46c2-bdb4-8fb42942abe0/job-1/hydranet_prod_base_t_82b9b21d-3ac5-4f55-a692-5b84119e9daa/wd/runTaskLetTask_error.json
2021/09/08 18:58:11 Wrapper cmd failed with err: exit status 134
2021/09/08 18:58:11 Attempt 1 of http call to http://10.0.0.19:16384/sendlogstoartifacts/status
2021/09/08 18:58:11 Send process info logs to master server succeeded
2021/09/08 18:58:11 mpirun version string: {
mpirun (Open MPI) 3.1.2
Report bugs to http://www.open-mpi.org/community/help/
}
2021/09/08 18:58:11 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 2
FilteredData: 0.
2021/09/08 18:58:11 Process Exiting with Code: 134
2021/09/08 18:58:11 All App Insights Logs was sent successfully or the close timeout of 10 was reached

I searched the message in the title, and the common solution of adding

config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
config.gpu_options.polling_inactive_delay_msecs = 10
session = tf.compat.v1.Session(config=config)

did not work.

One final idea is that this set-up does not work on RTX 3xxx. However, I am not sure what kind of GPU Azure is using.

Could anyone help? Thank you!!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,334 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.