DataTransferStep Class

Creates an Azure ML Pipeline step that transfers data between storage options.

DataTransferStep supports common storage types such as Azure Blob Storage and Azure Data Lake as sources and sinks. For more more information, see the Remarks section.

For an example of using DataTransferStep, see the notebook https://aka.ms/pl-data-trans.

Create an Azure ML Pipeline step that transfers data between storage options.

Inheritance
azureml.pipeline.core._data_transfer_step_base._DataTransferStepBase
DataTransferStep

Constructor

DataTransferStep(name, source_data_reference=None, destination_data_reference=None, compute_target=None, source_reference_type=None, destination_reference_type=None, allow_reuse=True)

Parameters

name
str
Required

[Required] The name of the step.

source_data_reference
Union[InputPortBinding, DataReference, PortDataReference, PipelineData]
default value: None

[Required] An input connection that serves as source of the data transfer operation.

destination_data_reference
Union[InputPortBinding, PipelineOutputAbstractDataset, DataReference]
default value: None

[Required] An output connection that serves as destination of the data transfer operation.

compute_target
DataFactoryCompute, str
default value: None

[Required] An Azure Data Factory to use for transferring data.

source_reference_type
str
default value: None

An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, the type of existing path is used. Use this parameter to differentiate between a file and directory of the same name.

destination_reference_type
str
default value: None

An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, Azure ML uses the type of existing path, source reference, or 'directory', in that order.

allow_reuse
bool
default value: True

Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If step arguments remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of transferring data again, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.

name
str
Required

[Required] The name of the step.

source_data_reference
Union[InputPortBinding, DataReference, PortDataReference, PipelineData]
Required

[Required] An input connection that serves as source of the data transfer operation.

destination_data_reference
Union[InputPortBinding, PipelineOutputAbstractDataset, DataReference]
Required

[Required] An output connection that serves as destination of the data transfer operation.

compute_target
DataFactoryCompute, str
Required

[Required] An Azure Data Factory to use for transferring data.

source_reference_type
str
Required

An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, the type of existing path is used. Use this parameter to differentiate between a file and directory of the same name.

destination_reference_type
str
Required

An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, Azure ML uses the type of existing path, source reference, or 'directory', in that order.

allow_reuse
bool
Required

Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If step arguments remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of transferring data again, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.

Remarks

This step supports the following storage types as sources and sinks except where noted:

  • Azure Blob Storage

  • Azure Data Lake Storage Gen1 and Gen2

  • Azure SQL Database

  • Azure Database for PostgreSQL

  • Azure Database for MySQL

For Azure SQL Database, you must use service principal authentication. For more information, see Service Principal Authentication. For an example of using service principal authentication for Azure SQL Database, see https://aka.ms/pl-data-trans.

To establish data dependency between steps, use the get_output method to get a PipelineData object that represents the output of this data transfer step and can be used as input for later steps in the pipeline.


   data_transfer_step = DataTransferStep(name="copy data", ...)

   # Use output of data_transfer_step as input of another step in pipeline
   # This will make training_step wait for data_transfer_step to complete
   training_input = data_transfer_step.get_output()
   training_step = PythonScriptStep(script_name="train.py",
                           arguments=["--model", training_input],
                           inputs=[training_input],
                           compute_target=aml_compute,
                           source_directory=source_directory)

To create an InputPortBinding with specific name, you can combine get_output() output with the output of the as_input or as_mount methods of PipelineData.


   data_transfer_step = DataTransferStep(name="copy data", ...)
   training_input = data_transfer_step.get_output().as_input("my_input_name")

Methods

create_node

Create a node from the DataTransfer step and add it to the given graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

get_output

Get the output of the step as PipelineData.

create_node

Create a node from the DataTransfer step and add it to the given graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Parameters

graph
Graph
Required

The graph object to add the node to.

default_datastore
Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore]
Required

The default datastore.

context
<xref:azureml.pipeline.core._GraphContext>
Required

The graph context.

Returns

The created node.

Return type

get_output

Get the output of the step as PipelineData.

get_output()

Returns

The output of the step.

Return type

Remarks

To establish data dependency between steps, use get_output method to get a PipelineData object that represents the output of this data transfer step and can be used as input for later steps in the pipeline.


   data_transfer_step = DataTransferStep(name="copy data", ...)

   # Use output of data_transfer_step as input of another step in pipeline
   # This will make training_step wait for data_transfer_step to complete
   training_input = data_transfer_step.get_output()
   training_step = PythonScriptStep(script_name="train.py",
                           arguments=["--model", training_input],
                           inputs=[training_input],
                           compute_target=aml_compute,
                           source_directory=source_directory)

To create an InputPortBinding with specific name, you can combine get_output() call with as_input or as_mount helper methods.


   data_transfer_step = DataTransferStep(name="copy data", ...)

   training_input = data_transfer_step.get_output().as_input("my_input_name")