How to make Azure ML pipeline re-run when a step fails? Either re-running the entire pipeline or just that 1 step would be fine

Puifai Santisakultarm 0 Reputation points
2025-06-20T09:48:48.27+00:00

I have an ML Pipeline that contains multiple steps--multiple data preparation steps, model inference, then save the output. This ML pipeline runs automatically as per scheduled Job. Without going into details, sometimes one of the data preparation steps would fail, causing the entire ML pipeline job to fail. I currently solve this by going to Azure ML Studio and click trigger manually; and the job will complete fine. But I would like the job to automatically retry the ML pipeline if it fails (due to one of the steps failing).

Is there a way to do this? It seems retry activity is an option one can set in Azure Data Factory pipeline. I am looking for this same option in Azure ML, if it exists.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,332 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Chiugo Okpala 1,905 Reputation points MVP
    2025-06-20T20:45:51.2366667+00:00

    @Puifai Santisakultarm welcome to the Microsoft Q&A community.

    Currently Azure Machine Learning pipelines do not currently offer a native "retry on failure" setting for individual steps.

    But you can try the following options below:

    1. Use Azure Logic Apps or Azure Functions

    You can set up a Logic App or Function to monitor pipeline run statuses. If a run fails, it can automatically trigger a re-run of the pipeline via its endpoint. This approach is lightweight and is GUI-friendly:

    Create a PipelineEndpoint for your ML pipeline.

    Configure a Logic App to listen for failed runs.

    On failure, the Logic App can call the endpoint to re-trigger the pipeline.

    1. Custom Retry Logic in Your Script

    If the failure is intermittent (e.g., network hiccups), you can wrap the logic in your PythonScriptStep with retry logic using try/except and a loop.

    1. Split the Pipeline into Smaller Pipelines

    Break your pipeline into modular sub-pipelines. This way, if a step fails, you only need to re-run the failed sub-pipeline rather than the entire workflow. You can orchestrate these using Logic Apps or even a custom controller script.

    1. Track Step Outputs and Use allow_reuse=True

    If your steps produce outputs that are cached, Azure ML can skip re-running successful steps when you manually re-trigger the pipeline. This doesn’t retry automatically, but it minimizes redundant computation.

    N/B: I have generated the above answer using co-pilot as an AI tool. Also I have validated and updated the AI output.

    I hope these helps. Let me know if you have any further questions or need additional assistance.

    Also if these answers your query, do click the "Upvote" and click "Accept the answer" of which might be beneficial to other community members reading this thread.

    User's image

    0 comments No comments

  2. Puifai Santisakultarm 0 Reputation points
    2025-06-23T07:54:56.57+00:00
    1. Custom Retry Logic in Your Script If the failure is intermittent (e.g., network hiccups), you can wrap the logic in your PythonScriptStep with retry logic using try/except and a loop.

    Would you please elaborate on how to do this since my script sets up the pipeline job definitions that runs on schedule? Perhaps an example on this?

        # Dataprep
        @property
        def prep_data(self):
            @dsl.pipeline(some_stuff)
            def dataprep_pipeline(some_stuff):
                # do some stuff)
                return {"output_file": output_file}
    
            return dataprep_pipeline
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.