how to fail an Azure ML run?

Fabien Campagne 41 Reputation points
2020-12-10T14:04:57.12+00:00

We are using Azure ML for large tests to facilitate testing our code on CUDA in an automated manner. Things work mostly well, but one thing we cannot figure out is how to fail a job such that the job failure

  1. shows in the UI as Failed (see snapshot),
  2. gets propagated back to the submitting client (our testing code) such that we can fail the test when the Run has reached failed state.

Here's what we tried:

  • Exit the run process with a non-zero status.
  • Use the Run instance to send the non-zero exit code and a reason from the VM.
  • Try to detect Failed state or reason

When we call the following method:
def report_error(returncode: int):
from azureml.core.run import Run
run = Run.get_context(allow_offline=False)
print(f"Failing the run with return code={returncode}")
run.fail(f"A process returned a non-zero status code {returncode}", error_code=returncode)
exit(returncode)

We can see the exit code in the UI at the top of a failed run, but the run is still marked as Completed.
As a result, we are unable to determine that the job failed from the submitting client.
After:
run.wait_for_completion(show_output=True,
raise_on_error=True)
We tried:
if result['status'] != 'Completed' or (result['details'] is not None and
'A process returned a non-zero status code' in result['details']):
run.fail(error_details=result['details'], error_code=1)
exit(1)
Yet, the return value of this process, communicated to the test client is zero.

Is this a timing issue in obtaining the result details?

What could we do to make sure such jobs actually show as Failed in the UI?

46960-run-27-microsoft-azure-machine-learning.png

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,572 questions
{count} vote