Share via

Submitted script failed with a non-zero exit code; see the driver log file for details.

Nishiyama4477 1 Reputation point
2021-05-16T15:15:05.71+00:00

Hello, I'm trying to do Hyperparameter tuning a model with Azure Machine Learning with this : https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters.

This time I created dataset from blob storage and used tensor-flow for model.

I run the code and faced the error saying :

AzureMLCompute job failed.
JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details.
Reason: Job failed with non-zero exit Code

so I looked at the driver log file and I guess the error happed around tensor-flow because of ~~ Could not load dynamic library 'libcudart.so.11.0~~
So I searched on Google but still I cannot understand. Is tensor-flow model need GPU? or what? Someone could help?

This is my driver log:

2021/05/16 10:51:45 Starting App Insight Logger for task: runTaskLet
2021/05/16 10:51:45 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info
2021/05/16 10:51:45 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
[2021-05-16T10:51:45.699995] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train.py', '--data-folder', 'DatasetConsumptionConfig:input__c9585223', '--Learning_rate', '0.051451410113531125', '--batchsize', '16', '--epochs', '150', '--monitor', 'val_loss', '--optimizer', 'Adam'])
Script type = None
[2021-05-16T10:51:47.320428] Entering Run History Context Manager.
[2021-05-16T10:51:48.009123] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/koichi_2/azureml/hd_15c1b55e-af9d-4a09-a51a-77021483d144_1/mounts/workspaceblobstore/azureml/HD_15c1b55e-af9d-4a09-a51a-77021483d144_1
[2021-05-16T10:51:48.009468] Preparing to call script [train.py] with arguments:['--data-folder', '$input__c9585223', '--Learning_rate', '0.051451410113531125', '--batchsize', '16', '--epochs', '150', '--monitor', 'val_loss', '--optimizer', 'Adam']
[2021-05-16T10:51:48.009538] After variable expansion, calling script [train.py] with arguments:['--data-folder', '/mnt/batch/tasks/shared/LS_root/jobs/koichi_2/azureml/hd_15c1b55e-af9d-4a09-a51a-77021483d144_1/wd/tmpjipv9560', '--Learning_rate', '0.051451410113531125', '--batchsize', '16', '--epochs', '150', '--monitor', 'val_loss', '--optimizer', 'Adam']

2021-05-16 10:51:48.499303: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_9a5f179879cdab6df3327a8de34708df/lib:
2021-05-16 10:51:48.499444: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021/05/16 10:51:50 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 1
FilteredData: 0.
Data folder: /mnt/batch/tasks/shared/LS_root/jobs/koichi_2/azureml/hd_15c1b55e-af9d-4a09-a51a-77021483d144_1/wd/tmpjipv9560

[2021-05-16T10:52:00.314862] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 0.07348918914794922 seconds
Traceback (most recent call last):
File "train.py", line 66, in <module>
category = int((filename.split('.')[0]).split('_')[1])
IndexError: list index out of range

[2021-05-16T10:52:00.547629] Finished context manager injector with Exception.
2021/05/16 10:52:02 Skipping parsing control script error. Reason: Error json file doesn't exist. This most likely means that no errors were written to the file. File path: /mnt/batch/tasks/workitems/0e6791a9-a28d-438a-b11f-36b36ecd8da0/job-1/hd_15c1b55e-af9d-4a0_5be5fa54-845a-471b-8be4-e1ea80f84819/wd/runTaskLetTask_error.json
2021/05/16 10:52:02 Failed to run the wrapper cmd with err: exit status 1
2021/05/16 10:52:02 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
2021/05/16 10:52:02 mpirun version string: {
Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
}
2021/05/16 10:52:02 MPI publisher: intel ; version: 2018
2021/05/16 10:52:02 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 2
FilteredData: 0.
2021/05/16 10:52:02 Process Exiting with Code: 1
2021/05/16 10:52:02 All App Insights Logs was send successfully

Azure Machine Learning

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.