Passing data between AzureML pipeline steps with OutputFileDatasetConfig: difference between 'inputs/outputs' and 'arguments'?

asked 2022-06-02T07:57:23.45+00:00
ThierryL 136 Reputation points

Hello,

I have been successfully building and operating machine learning pipelines with Azure SDK, but there is something I fail to fully understand, and I'm wondering if my code can be simplified in some way.

Let's say I have a simple pipeline with two steps: the first step processes data located at 'training_data_path' in Blob storage and then saves it to the same location, and the second step reads that processed data to do something else. So my code is as follows:

def_data_store = ws.get_default_datastore()
training_data_path = (def_data_store, 'training_data')

step_1_config = OutputFileDatasetConfig(destination = training_data_path)
step_2_config = OutputFileDatasetConfig(destination = training_data_path)

step_1 = PythonScriptStep(
    name="Step 1",
    script_name="step_1.py",
    source_directory="./",
    outputs=[step_1_config],
    arguments = [
        "--training-data-path", step_1_config
        ],    
    compute_target=compute_target,
    runconfig=aml_run_config,
    allow_reuse=False
)

step_2 = PythonScriptStep(
    name="Step 2",
    script_name="step_2.py",
    source_directory="./",
    inputs=[step_1_config.as_input('training_data')],
    arguments = [
        "--training-data-path", step_2_config
        ],    
    compute_target=compute_target,
    runconfig=aml_run_config,
    allow_reuse=False
)

I have two questions about that:

1) Even though the path to the data is the same in each step, it seems like I have to create a separate OutputFileDatasetConfig object for each step. So if my pipeline has 10 steps, I will create step_1_config, step_2_config, step_3_config... Isn't there a way to reuse the same OutputFileDatasetConfig object for multiple steps?

2) As far as I know, in step 2, I could delete the 'inputs' parameter and modify the 'arguments' parameter as follows, the result would be the same.

step_2 = PythonScriptStep(
    name="Step 2",
    script_name="step_2.py",
    source_directory="./",
    arguments = [
        "--training-data-path", step_1_config.as_input('training_data')
        ],    
    compute_target=compute_target,
    runconfig=aml_run_config,
    allow_reuse=False
)

My question is: is there any difference when specifying the input using both the 'inputs' and 'arguments' parameters Vs. using only the 'arguments' parameter?

Thanks.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
1,633 questions
No comments
{count} votes

Accepted answer
  1. answered 2022-06-03T13:02:24.427+00:00
    romungi-MSFT 27,006 Reputation points Microsoft Employee

    @ThierryL-3166 I think the recommendation to use separate OutputFileDatasetConfig objects for different steps is to avoid concurrent writes to a single object. As stated in a note in documentation:

    Concurrent writes to a OutputFileDatasetConfig will fail. Do not attempt to use a single OutputFileDatasetConfig concurrently. Do not share a single OutputFileDatasetConfig in a multiprocessing situation, such as when using distributed training.  
    

    If your steps do not run in parallel then you can try to use a single object and check though.

    With respect to using inputs or arguments, If you are using the same for the same operation then arguments would pass the same as input to the script used in the same step and you would need to use an argparser to retrieve the value in the script. Whereas, inputs would provide the same value as the run objects context in the same script. The section access datasets within script provides an example here for a train and test dataset where train is passed with arguments and test with inputs.

    smaller_dataset = iris_dataset.take_sample(0.1, seed=seed) # 10%  
    train, test = smaller_dataset.random_split(percentage=0.8, seed=seed)  
      
    # In pipeline definition script:  
    # Code for demonstration only: It would be very confusing to split datasets between `arguments` and `inputs`  
    train_step = PythonScriptStep(  
        name="train_data",  
        script_name="train.py",  
        compute_target=cluster,  
        arguments=['--training-folder', train.as_named_input('train').as_download()],  
        inputs=[test.as_named_input('test').as_download()]  
    )  
      
    # In pipeline script  
    parser = argparse.ArgumentParser()  
    parser.add_argument('--training-folder', type=str, dest='train_folder', help='training data folder mounting point')  
    args = parser.parse_args()  
    training_data_folder = args.train_folder  
      
    testing_data_folder = Run.get_context().input_datasets['test']  
    
    No comments

0 additional answers

Sort by: Most helpful