Share via

Azure Machine Learning ExperimentExecutionException while submitting a distributed training run !

Valentin Laurent 106 Reputation points
2021-05-01T18:16:21.027+00:00

Hi, here is the details of my issue.
I want to execute a distributed training run with the Tensorflow framework and Horovod.
To do this, I've configured a environment called "tf_env" as follow :

# Create the environment : the dependencies are in the .yml file
tf_env = Environment.from_conda_specification(name="tensorflow_environment", file_path="experiments/package-list.yml")

# Register the environment
tf_env.register(workspace=ws)

# Specify a GPU base image
tf_env.docker.enabled = True
tf_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

Where my "package-list.yml" contains all the dependencies my "train_script.py" requires.
I've defined my ScriptConfigRun as follow :

arguments = [
    (... other arguments ...)
    "--ds",  images_ds.as_mount()
]

src = ScriptRunConfig(
    source_directory="experiments",
    script='train_script.py',
    arguments=arguments,
    compute_target=compute_target,
    environment=tf_env,
    distributed_job_config=MpiConfiguration(node_count=2)
)

Then, when I want to submit the run :

run = best_model_experiment.submit(config=src)

... it raises this error I don't understand :

ExperimentExecutionException: ExperimentExecutionException:
    Message: {
    "error_details": {
        "componentName": "execution",
        "correlation": {
            "operation": "***",
            "request": "***"
        },
        "environment": "westeurope",
        "error": {
            "code": "UserError",
            "message": "Error when parsing request; unable to deserialize request body"
        },
        "location": "westeurope",
        "time": "***"
    },
    "status_code": 400,
    "url": "https://westeurope.experiments.azureml.net/execution/v1.0/subscriptions/***/resourceGroups/***/providers/Microsoft.MachineLearningServices/workspaces/***/experiments/experiment/snapshotrun?runId=experiment***"
}
    InnerException None
    ErrorResponse 
{
    "error": {
        "message": "{\n    \"error_details\": {\n        \"componentName\": \"execution\",\n        \"correlation\": {\n            \"operation\": \"***\",\n            \"request\": \"***\"\n        },\n        \"environment\": \"westeurope\",\n        \"error\": {\n            \"code\": \"UserError\",\n            \"message\": \"Error when parsing request; unable to deserialize request body\"\n        },\n        \"location\": \"westeurope\",\n        \"time\": \"***\"\n    },\n    \"status_code\": 400,\n    \"url\": \"https://westeurope.experiments.azureml.net/execution/v1.0/subscriptions/***/resourceGroups/***/providers/Microsoft.MachineLearningServices/workspaces/***/experiments/experiment/snapshotrun?runId=experiment_***\"\n}"
    }
}

Could you please help me decrypt this error ?
Thank you.

Azure Machine Learning
0 comments No comments

Answer accepted by question author

Valentin Laurent 106 Reputation points
2021-05-03T06:48:41.707+00:00

Issue solved ! I've given a list in arguments to argparse so it could'nt deserialized the object.

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.