how to fail an Azure ML run?

Fabien Campagne 41

We are using Azure ML for large tests to facilitate testing our code on CUDA in an automated manner. Things work mostly well, but one thing we cannot figure out is how to fail a job such that the job failure

shows in the UI as Failed (see snapshot),
gets propagated back to the submitting client (our testing code) such that we can fail the test when the Run has reached failed state.

Here's what we tried:

Exit the run process with a non-zero status.
Use the Run instance to send the non-zero exit code and a reason from the VM.
Try to detect Failed state or reason

When we call the following method:
def report_error(returncode: int):
from azureml.core.run import Run
run = Run.get_context(allow_offline=False)
print(f"Failing the run with return code={returncode}")
run.fail(f"A process returned a non-zero status code {returncode}", error_code=returncode)
exit(returncode)

We can see the exit code in the UI at the top of a failed run, but the run is still marked as Completed.
As a result, we are unable to determine that the job failed from the submitting client.
After:
run.wait_for_completion(show_output=True,
raise_on_error=True)
We tried:
if result['status'] != 'Completed' or (result['details'] is not None and
'A process returned a non-zero status code' in result['details']):
run.fail(error_details=result['details'], error_code=1)
exit(1)
Yet, the return value of this process, communicated to the test client is zero.

Is this a timing issue in obtaining the result details?

What could we do to make sure such jobs actually show as Failed in the UI?

romungi-MSFT 43,696 Reputation points Microsoft Employee

2020-12-11T07:51:38.993+00:00

@Fabien Campagne Could you please clarify if these runs are submitted from the UI and you are trying to fail them from the SDK while they are running?
In the second scenario are you waiting for a job submitted from SDK to complete and then update its details and error code so they show up failed on UI?
Fabien Campagne 41 Reputation points

2020-12-12T03:27:38.68+00:00

Second case, trying to fail them from the job running on the VM.

This is the code used on the VM to attempt to fail the job:

def report_error(returncode: int):
from azureml.core.run import Run
run = Run.get_context(allow_offline=False)
print(f"Failing the run with return code={returncode}")
run.fail(f"A process returned a non-zero status code {returncode}", error_code=returncode)
exit(returncode)
The reason is updated, but the job does not fail.
Mallik, Sourav 0 Reputation points

2024-06-10T08:34:44.05+00:00

I don't see any resolution to this issue, so guessing this is a bug in Azure ML UI? Is there any walkaround this issue?

Share via

how to fail an Azure ML run?