Submitted script failed with a non-zero exit code; see the driver log file for details.

Question

Submitted script failed with a non-zero exit code; see the driver log file for details.

Nishiyama4477 1

Hello, I'm trying to do Hyperparameter tuning a model with Azure Machine Learning with this : https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters.

This time I created dataset from blob storage and used tensor-flow for model.

I run the code and faced the error saying :

AzureMLCompute job failed.
JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details.
Reason: Job failed with non-zero exit Code

so I looked at the driver log file and I guess the error happed around tensor-flow because of ~~ Could not load dynamic library 'libcudart.so.11.0~~
So I searched on Google but still I cannot understand. Is tensor-flow model need GPU? or what? Someone could help?

This is my driver log:

2021/05/16 10:51:45 Starting App Insight Logger for task: runTaskLet
2021/05/16 10:51:45 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info
2021/05/16 10:51:45 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
[2021-05-16T10:51:45.699995] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train.py', '--data-folder', 'DatasetConsumptionConfig:input__c9585223', '--Learning_rate', '0.051451410113531125', '--batchsize', '16', '--epochs', '150', '--monitor', 'val_loss', '--optimizer', 'Adam'])
Script type = None
[2021-05-16T10:51:47.320428] Entering Run History Context Manager.
[2021-05-16T10:51:48.009123] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/koichi_2/azureml/hd_15c1b55e-af9d-4a09-a51a-77021483d144_1/mounts/workspaceblobstore/azureml/HD_15c1b55e-af9d-4a09-a51a-77021483d144_1
[2021-05-16T10:51:48.009468] Preparing to call script [train.py] with arguments:['--data-folder', '$input__c9585223', '--Learning_rate', '0.051451410113531125', '--batchsize', '16', '--epochs', '150', '--monitor', 'val_loss', '--optimizer', 'Adam']
[2021-05-16T10:51:48.009538] After variable expansion, calling script [train.py] with arguments:['--data-folder', '/mnt/batch/tasks/shared/LS_root/jobs/koichi_2/azureml/hd_15c1b55e-af9d-4a09-a51a-77021483d144_1/wd/tmpjipv9560', '--Learning_rate', '0.051451410113531125', '--batchsize', '16', '--epochs', '150', '--monitor', 'val_loss', '--optimizer', 'Adam']

2021-05-16 10:51:48.499303: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_9a5f179879cdab6df3327a8de34708df/lib:
2021-05-16 10:51:48.499444: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021/05/16 10:51:50 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 1
FilteredData: 0.
Data folder: /mnt/batch/tasks/shared/LS_root/jobs/koichi_2/azureml/hd_15c1b55e-af9d-4a09-a51a-77021483d144_1/wd/tmpjipv9560

[2021-05-16T10:52:00.314862] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 0.07348918914794922 seconds
Traceback (most recent call last):
File "train.py", line 66, in <module>
category = int((filename.split('.')[0]).split('_')[1])
IndexError: list index out of range

[2021-05-16T10:52:00.547629] Finished context manager injector with Exception.
2021/05/16 10:52:02 Skipping parsing control script error. Reason: Error json file doesn't exist. This most likely means that no errors were written to the file. File path: /mnt/batch/tasks/workitems/0e6791a9-a28d-438a-b11f-36b36ecd8da0/job-1/hd_15c1b55e-af9d-4a0_5be5fa54-845a-471b-8be4-e1ea80f84819/wd/runTaskLetTask_error.json
2021/05/16 10:52:02 Failed to run the wrapper cmd with err: exit status 1
2021/05/16 10:52:02 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
2021/05/16 10:52:02 mpirun version string: {
Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
}
2021/05/16 10:52:02 MPI publisher: intel ; version: 2018
2021/05/16 10:52:02 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 2
FilteredData: 0.
2021/05/16 10:52:02 Process Exiting with Code: 1
2021/05/16 10:52:02 All App Insights Logs was send successfully

Ramr-msft 17,826 Reputation points

2021-05-17T13:21:29.48+00:00

@Nishiyama4477 Thanks for the question. Can you please add more details about the tensorflow version and Base image that you are trying.
Ramr-msft 17,826 Reputation points

2021-05-25T07:12:23.25+00:00

@Nishiyama4477 Just checking any update on the details.

Your answer

Ramr-msft 17,826 Reputation points

2021-05-17T13:21:29.48+00:00

@Nishiyama4477 Thanks for the question. Can you please add more details about the tensorflow version and Base image that you are trying.
Ramr-msft 17,826 Reputation points

2021-05-25T07:12:23.25+00:00

@Nishiyama4477 Just checking any update on the details.

Share via

Submitted script failed with a non-zero exit code; see the driver log file for details.

Your answer