Behavior of 'allow_reuse' in a published pipeline referencing a versioned dataset / Newer datasets are ignored by the pipeline?

ThierryL 141 Reputation points
2022-03-22T05:24:24.217+00:00

Hello,

I have built an Azure ML pipeline which will be called from a Data Factory pipeline at a specified interval.
The ML pipeline contains PythonScriptSteps which take various datasets as an input. All datasets are registered and versioned in the same Azure ML workspace.
The Data Factory pipeline regularly updates the datasets to a new version and then run the registered ML pipeline. The expected behavior is that the ML pipeline will read the latest version of each dataset every time it is run by Data Factory.

Even though I have set the parameter 'allow_reuse' of my PythonScriptStep objects to 'True', I expect the steps to be executed when the input datasets have been updated to a new version. But the problem is they are not executed. In the Experiments details, I can see that each step has its flag 'Reuse' set to 'Yes' and the execution duration is 0 second.

When I look at the registered pipeline in 'Azure ML Studio -> Pipelines -> Pipeline endpoints' and I click on the input datasets, I can see that they have a parameter 'version' with an old version of the dataset 'hardcoded' in it (see picture below). It seems like once an ML pipeline has been published, it will always use the same dataset version at every run, even though the dataset have been updated in between. Is this the normal behavior?

Here is a simplified version of my pipeline code.

# Get a reference to the workspace  
ws = Workspace.from_config()  
  
# Get the registered dataset  
dataset1 = Dataset.get_by_name(ws, 'dataset1')  
  
# Pipeline's first step  
script1 = PythonScriptStep(  
    name="Script 1",  
    script_name="script.py",  
    source_directory="./",  
    inputs=[dataset1.as_named_input('dataset1')], # Passes the dataset path to the script  
    compute_target=compute_target,  
    runconfig=aml_run_config,  
    allow_reuse=True  
)  
  
# Build the pipeline  
pipeline = Pipeline(workspace=ws, steps=[[script1]])  
  
from azureml.core import Experiment  
  
# Submit the pipeline to be run  
pipeline_run = Experiment(ws, 'experiment').submit(pipeline)  
pipeline_run.wait_for_completion()  
  
published_pipeline = pipeline.publish(name = "pipeline1")  

185369-pipeline.jpg

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,561 questions
{count} votes

2 answers

Sort by: Most helpful
  1. ThierryL 141 Reputation points
    2022-03-22T23:40:34.007+00:00

    I created and published a simple pipeline with the code below:

    from azureml.core import Workspace, Dataset  
    from azureml.core.runconfig import RunConfiguration  
    from azureml.pipeline.steps import PythonScriptStep  
    from azureml.core.compute import ComputeTarget  
    from azureml.pipeline.core import Pipeline  
      
    ws = Workspace.from_config()  
      
    compute_target = ComputeTarget(workspace=ws, name='DS3-v2-standard-cpu')  
    compute_target.wait_for_completion(show_output=True)  
      
    aml_run_config = RunConfiguration()  
    aml_run_config.target = compute_target  
      
    da_rolled = Dataset.get_by_name(ws, 'da_rolled', version = 'latest')  
      
    step1 = PythonScriptStep(  
        name="Step1",  
        script_name="test.py",  
        source_directory="./",  
        inputs=[da_rolled.as_named_input('da_rolled')],  
        compute_target=compute_target,  
        runconfig=aml_run_config,  
        allow_reuse=False  
    )  
      
    pipeline = Pipeline(workspace=ws, steps=[[step1]])  
      
    published_pipeline = pipeline.publish(name = "TestPipeline")  
    

    You can see in the image below that the version of the dataset (23, which is the latest version at the time the pipeline is published) is hardcoded in the pipeline definition.

    185806-pipeline.jpg

    And this is my dataset.

    185891-dataset.jpg

    Now if I run my Data Factory pipeline to update the dataset to a new version (which will make it version 24), the version in the pipeline definition will still be 23.
    It seems like I need to republish the pipeline every time the dataset is updated.


  2. annu.shokeen@team.telstra.com 6 Reputation points
    2023-03-15T01:01:39.89+00:00

    Hi @David Lazaridis
    The pipeline needs to be submitted before publishing. When you submit it, you can specify regenerate_outputs = True

    from azureml.core import Experiment
    test_pipeline_run1 = Experiment(workspace, 'ExperimentName').submit(test_pipeline, regenerate_outputs=True )
    test_pipeline_run1.wait_for_completion()
    
    

    After this, the pipeline should be published.

    regenerateoutputs=True will make sure that output is generated every time the pipeline is run. Allow_reuse = False flag in the python script steps will make sure that new version of dataset is used.

    These 2 factors together have made my pipelines work on new dataset every time. I hope it helps.