How to make Azure ML pipeline re-run when a step fails? Either re-running the entire pipeline or just that 1 step would be fine

Question

How to make Azure ML pipeline re-run when a step fails? Either re-running the entire pipeline or just that 1 step would be fine

Puifai Santisakultarm 0

I have an ML Pipeline that contains multiple steps--multiple data preparation steps, model inference, then save the output. This ML pipeline runs automatically as per scheduled Job. Without going into details, sometimes one of the data preparation steps would fail, causing the entire ML pipeline job to fail. I currently solve this by going to Azure ML Studio and click trigger manually; and the job will complete fine. But I would like the job to automatically retry the ML pipeline if it fails (due to one of the steps failing).

Is there a way to do this? It seems retry activity is an option one can set in Azure Data Factory pipeline. I am looking for this same option in Azure ML, if it exists.

2 answers

Your answer

Answer 1

@Puifai Santisakultarm welcome to the Microsoft Q&A community.

Currently Azure Machine Learning pipelines do not currently offer a native "retry on failure" setting for individual steps.

But you can try the following options below:

Use Azure Logic Apps or Azure Functions

You can set up a Logic App or Function to monitor pipeline run statuses. If a run fails, it can automatically trigger a re-run of the pipeline via its endpoint. This approach is lightweight and is GUI-friendly:

Create a PipelineEndpoint for your ML pipeline.

Configure a Logic App to listen for failed runs.

On failure, the Logic App can call the endpoint to re-trigger the pipeline.

Custom Retry Logic in Your Script

If the failure is intermittent (e.g., network hiccups), you can wrap the logic in your PythonScriptStep with retry logic using try/except and a loop.

Split the Pipeline into Smaller Pipelines

Break your pipeline into modular sub-pipelines. This way, if a step fails, you only need to re-run the failed sub-pipeline rather than the entire workflow. You can orchestrate these using Logic Apps or even a custom controller script.

Track Step Outputs and Use allow_reuse=True

If your steps produce outputs that are cached, Azure ML can skip re-running successful steps when you manually re-trigger the pipeline. This doesn’t retry automatically, but it minimizes redundant computation.

N/B: I have generated the above answer using co-pilot as an AI tool. Also I have validated and updated the AI output.

I hope these helps. Let me know if you have any further questions or need additional assistance.

Also if these answers your query, do click the "Upvote" and click "Accept the answer" of which might be beneficial to other community members reading this thread.

User's image

Answer 2

Custom Retry Logic in Your Script If the failure is intermittent (e.g., network hiccups), you can wrap the logic in your PythonScriptStep with retry logic using try/except and a loop.

Would you please elaborate on how to do this since my script sets up the pipeline job definitions that runs on schedule? Perhaps an example on this?

    # Dataprep
    @property
    def prep_data(self):
        @dsl.pipeline(some_stuff)
        def dataprep_pipeline(some_stuff):
            # do some stuff)
            return {"output_file": output_file}

        return dataprep_pipeline

Share via

How to make Azure ML pipeline re-run when a step fails? Either re-running the entire pipeline or just that 1 step would be fine

2 answers

Your answer