failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Hi, I am trying to train a model on AZURE AML A100.
I have trained the same model on my GPU server before with tensorflow_gpu-1.15.5, python 3.7, Gcc 7.5.0, cuDNN 7.6.5 , cuda 10.0
I used a docker file to curated the same env, so I am sure it has tensorflow_gpu-1.15.5, python 3.7, cuDNN 7.6.5 , cuda 10.0. The only thing I am not sure is Gcc 7.5.0.
However, I am keeping getting the error message
start training
2021-09-08 18:18:21.911226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-09-08 18:58:09.545333: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2021-09-08 18:58:09.545451: I tensorflow/stream_executor/stream.cc:1990] [stream=0x55ae14908d10,impl=0x55ae14907470] did not wait for [stream=0x55ae14908a90,impl=0x55ae149074a0]
2021-09-08 18:58:09.545478: F tensorflow/core/common_runtime/gpu/gpu_util.cc:342] CPU->GPU Memcpy failed
2021-09-08 18:58:09.545528: I tensorflow/stream_executor/stream.cc:4938] [stream=0x55ae14908d10,impl=0x55ae14907470] did not memcpy host-to-device; source: 0x7fe9f749b000
2021-09-08 18:58:09.545529: I tensorflow/stream_executor/stream.cc:4938] [stream=0x55ae14908d10,impl=0x55ae14907470] did not memcpy host-to-device; source: 0x7fe9f74a1600
bash: line 1: 96 Aborted (core dumped) python $AZ_BATCHAI_JOB_TEMP/azureml/hydranet_prod_base_tf_1_15_5_1631122651_ba72eb11/azureml-setup/context_manager_injector.py "-i" "ProjectPythonPath:context_managers.ProjectPythonPath" "-i" "Dataset:context_managers.Datasets" "-i" "RunHistory:context_managers.RunHistory" "-i" "TrackUserError:context_managers.TrackUserError" "-i" "UserExceptions:context_managers.UserExceptions" "main_aml.py" "--note" "hydranet_prod_base_tf_1_15_5" "--mount_path" "DatasetConsumptionConfig:data_folder" "--conf" "conf/aml.conf" "--job" "train" "--in_train_feat_path" "account_clean_sq_distribution.10.null.feat.jsonl|opportunity_clean_sq_distribution.10.null.feat.jsonl|contact_clean_sq_distribution.10.null.feat.jsonl|lead_clean_sq_distribution.10.null.feat.jsonl|customerprofile_clean_sq_distribution.10.null.feat.jsonl|train.prioritysetexact.account.noise-level-0.feat.jsonl|train.prioritysetexact.opportunity.noise-level-0.feat.jsonl|train.prioritysetexact.contact.noise-level-0.feat.jsonl|train.prioritysetexact.lead.noise-level-0.feat.jsonl" "--in_dev_feat_path" "measurement.contact.noise-level-0.20201202.feat.jsonl" "--in_wikisql_train_feat_path" "wikisql.train.noise-level-0.20201202.feat.jsonl" "--out_subset_feat_dir" "outputs\subset_feat_dir" "--in_bert_root_path" "DatasetConsumptionConfig:base_model_folder"
2021/09/08 18:58:11 Skipping parsing control script error. Reason: Error json file doesn't exist. This most likely means that no errors were written to the file. File path: /mnt/batch/tasks/workitems/6670d3e3-260e-46c2-bdb4-8fb42942abe0/job-1/hydranet_prod_base_t_82b9b21d-3ac5-4f55-a692-5b84119e9daa/wd/runTaskLetTask_error.json
2021/09/08 18:58:11 Wrapper cmd failed with err: exit status 134
2021/09/08 18:58:11 Attempt 1 of http call to http://10.0.0.19:16384/sendlogstoartifacts/status
2021/09/08 18:58:11 Send process info logs to master server succeeded
2021/09/08 18:58:11 mpirun version string: {
mpirun (Open MPI) 3.1.2
Report bugs to http://www.open-mpi.org/community/help/
}
2021/09/08 18:58:11 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 2
FilteredData: 0.
2021/09/08 18:58:11 Process Exiting with Code: 134
2021/09/08 18:58:11 All App Insights Logs was sent successfully or the close timeout of 10 was reached
I searched the message in the title, and the common solution of adding
config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
config.gpu_options.polling_inactive_delay_msecs = 10
session = tf.compat.v1.Session(config=config)
did not work.
One final idea is that this set-up does not work on RTX 3xxx. However, I am not sure what kind of GPU Azure is using.
Could anyone help? Thank you!!