AzureML Pipeline step gets stuck on Finalizing status, eventually gets marked as failed.

Jaimin Bhoi 6 Reputation points
2022-07-05T14:03:56.827+00:00

My python file finishes the job( verified in driver logs file), but the step file when finalizing takes lot of time and eventually gets marked as failed step.

Logs from job post log file

[2022-07-05T13:44:27.063561] Entering job release
[2022-07-05T13:44:28.612901] Starting job release
[2022-07-05T13:44:28.613665] Logging experiment finalizing status in history service.
Starting the daemon thread to refresh tokens in background for process with pid = 330
[2022-07-05T13:44:28.614116] job release stage : upload_datastore starting...
[2022-07-05T13:44:28.616222] job release stage : start importing azureml.history._tracking in run_history_release.
[2022-07-05T13:44:28.616341] job release stage : execute_job_release starting...
[2022-07-05T13:44:28.616970] Entering context manager injector.
[2022-07-05T13:44:28.626569] job release stage : copy_batchai_cached_logs starting...
[2022-07-05T13:44:28.626790] job release stage : copy_batchai_cached_logs completed...
[2022-07-05T13:44:28.631233] job release stage : upload_datastore completed...
[2022-07-05T13:44:28.894740] job release stage : execute_job_release completed...
Failed to set run status: Finalizing
<urlopen error timed out>

Retrying...
Failed to set run status: Finalizing
<urlopen error timed out>

Retrying...
[2022-07-05T13:44:41.757229] job release stage : send_run_telemetry starting...
[2022-07-05T13:44:41.776454] get vm size and vm region successfully.
[2022-07-05T13:44:41.783878] get compute meta data successfully.
Failed to upload compute record artifact, error_details=<urlopen error [Errno 110] Connection timed out>
[2022-07-05T13:45:13.220949] job release stage : send_run_telemetry completed...
[2022-07-05T13:45:13.221247] Job release is complete

Attached error log screenshots for reference.

217719-error-logs.png
217775-driver-logs.png

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,444 questions
{count} votes