I'm using Microsoft Azure Machine Learning to train a CNN. This is the link to the github where the model is stored: https://github.com/rodekruis/caladrius/tree/handle_imbalance/caladrius. This code already works, I'm just trying to run the model (so do both training/testing) myself in my own Microsoft Azure Machine Learning environment now. I have the data needed for training, and by following the 'Tutorial: use your own data' (https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-bring-data) I uploaded the data in a datastore, which is of type 'Azure Blob Storage'. Then, I want to run the model such that it starts training. In order to do so, the 'run.py' file in the Github has to be run and I created the following control script to run it in the Microsoft Azure environment:
from azureml.core import Run, Workspace, Datastore, Dataset, Experiment, ScriptRunConfig, Environment
from azure.identity import DefaultAzureCredential
from azureml.data.datapath import DataPath
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig
from azureml.data import OutputFileDatasetConfig
if __name__ == "__main__":
run = Run.get_context()
credential = DefaultAzureCredential()
ws = Workspace.from_config()
datastore = Datastore.get(ws, 'xview')
dataset_small = Dataset.File.from_files(path=(datastore, '/test_small/**'))
#print(dataset_small.to_path())
#data_path = DataPath(datastore=datastore, path_on_datastore='test_small/')
checkpoint = OutputFileDatasetConfig(destination=(datastore, '/test_small/runs/'))
experiment = Experiment(workspace=ws, name='thesis-sanne')
config = ScriptRunConfig(source_directory='',
script='run.py',
compute_target='standardK80GPU',
arguments = ['--data-path',dataset_small.as_named_input('input').as_mount(),
'--output-type', 'classification',
'--run-name', 'test1',
'--checkpoint-path', checkpoint,
]
)
env= Environment.from_conda_specification(name='caladriusenv', file_path='caladriusenv.yml')
config.run_config.environment = env
run = experiment.submit(config)
aml_url = run.get_portal_url()
print(aml_url)
when I submit this run, the run fails after receiving the following error:
UserError: [Errno 2] No such file or directory: '9e3238da-8b8f-441a-a71a-2f6911f58f33/train/labels.txt'
See also in this picture: Error message from run
The exact error message given in the 70_driver_log.txt file is:
[2021-05-16T08:26:21.977916] The experiment failed. Finalizing run...
2021-05-16 08:26:21,978 main INFO Exiting context: TrackUserError
[2021-05-16T08:26:21.984462] Writing error with error_code UserError and error_hierarchy UserError/FileNotFoundError to hosttool error file located at /mnt/batch/tasks/workitems/073df44a-2932-4bfb-a484-64e6382d81a1/job-1/thesis-sanne_1621153_5c453238-c8de-474e-b42c-44f791fbccea/wd/runTaskLetTask_error.json
Starting the daemon thread to refresh tokens in background for process with pid = 85
2021-05-16 08:26:22,061 main INFO Exiting context: RunHistory
2021-05-16 08:26:22,061 main INFO Exiting context: Dataset
2021-05-16 08:26:22,062 main INFO Exiting context: ProjectPythonPath
Traceback (most recent call last):
File "run.py", line 59, in <module>
main()
File "run.py", line 46, in main
run_report, datasets, args.number_of_epochs, args.selection_metric
File "/mnt/batch/tasks/shared/LS_root/jobs/azureaccount/azureml/thesis-sanne_1621153514_fc6d0c0e/mounts/workspaceblobstore/azureml/thesis-sanne_1621153514_fc6d0c0e/model/trainer.py", line 395, in train
train_set, train_loader = datasets.load("train")
File "/mnt/batch/tasks/shared/LS_root/jobs/azureaccount/azureml/thesis-sanne_1621153514_fc6d0c0e/mounts/workspaceblobstore/azureml/thesis-sanne_1621153514_fc6d0c0e/model/data.py", line 144, in load
augment_type=self.augment_type,
File "/mnt/batch/tasks/shared/LS_root/jobs/azureaccount/azureml/thesis-sanne_1621153514_fc6d0c0e/mounts/workspaceblobstore/azureml/thesis-sanne_1621153514_fc6d0c0e/model/data.py", line 75, in init
os.path.join(self.directory, self.labels_filename) #self.labels_filename) # "labels.txt")
FileNotFoundError: [Errno 2] No such file or directory: '9e3238da-8b8f-441a-a71a-2f6911f58f33/train/labels.txt'
However, the datastore itself is found correctly and is stated as the 'Input datasets' in the picture, which includes the 'train/labels.txt' file I'm trying to open. I checked this in the script, by printing:
print(dataset_small.to_path())
And the output of this included the file '/train/labels.txt'
I think my problem is that the script does correctly call the dataset, however the data-path it needs to open the files, is incorrect. Trying to solve this problem I've already tried the following:
- Instead of '--data-path', dataset_small.as_named_input('input').as_mount() I used:
- DataPath(datastore=datastore, path_on_datastore='test_small/'), however this doesn't work as a DataPath object is not JSON serializable
- DataReference(datastore, path_on_datastore='./test_small/', mode='mount'), however this doesn't work as a DataReference object is not JSON serializable
- DatasetConsumptionConfig('dataset', dataset_small, mode='direct', path_on_compute=None), this is essentially what I do with Dataset.File.from_files() already and thus also doesn't work
- DataPathComputeBinding(mode='mount', path_on_compute=None, overwrite=False), however this doesn't work as a DataPathComputeBinding object is not JSON serializable
- datastore_paths = [(blob_datastore, 'test_small')], however this doesn't work as this object is not JSON serializable
- Instead of accessing the datastore in the Azure Machine Learning environment I tried accessing the datacontainer including the data directly from my Azure Storage account/container. However the same problem occured.
Thus the problem shortly is that my Azure Machine Learning does find the dataset I want it to use, which includes everything needed. However the model is not able to open the datafiles because it says they don't exist whereas I'm sure that they do exist and the Azure Machine Learning environment does know where to find them but does not know how to access/open them properly I think?
It seems to be an easy to fix problem, however I've tried every way I could think of or was proposed online and still nothing has worked yet. So hopefully someone here can help me, thank you so much in advance! If extra information is needed to clarify things, I'd be glad to provide it.