AdlaStep Class

Creates an Azure ML Pipeline step to run a U-SQL script with Azure Data Lake Analytics.

For an example of using this AdlaStep, see the notebook https://aka.ms/pl-adla.

Create an Azure ML Pipeline step to run a U-SQL script with Azure Data Lake Analytics.

Inheritance
azureml.pipeline.core._adla_step_base._AdlaStepBase
AdlaStep

Constructor

AdlaStep(script_name, name=None, inputs=None, outputs=None, params=None, degree_of_parallelism=None, priority=None, runtime_version=None, compute_target=None, source_directory=None, allow_reuse=True, version=None, hash_paths=None)

Parameters

script_name
str
Required

[Required] The name of a U-SQL script, relative to source_directory.

name
str
default value: None

The name of the step. If unspecified, script_name is used.

inputs
list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData]]
default value: None

A list of input port bindings.

outputs
list[Union[PipelineData, PipelineOutputAbstractDataset, OutputPortBinding]]
default value: None

A list of output port bindings.

params
dict
default value: None

A dictionary of name-value pairs.

degree_of_parallelism
int
default value: None

The degree of parallelism to use for this job. This must be greater than 0. If set to less than 0, defaults to 1.

priority
int
default value: None

The priority value to use for the current job. Lower numbers have a higher priority. By default, a job has a priority of 1000. The value you specify must be greater than 0.

runtime_version
str
default value: None

The runtime version of the Data Lake Analytics engine.

compute_target
AdlaCompute, str
default value: None

[Required] The ADLA compute to use for this job.

source_directory
str
default value: None

A folder that contains the script, assemblies etc.

allow_reuse
bool
default value: True

Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.

version
str
default value: None

Optional version tag to denote a change in functionality for the step.

hash_paths
list
default value: None

DEPRECATED: no longer needed.

A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of source_directory is hashed except for files listed in .amlignore or .gitignore.

script_name
str
Required

[Required] The name of a U-SQL script, relative to source_directory.

name
str
Required

The name of the step. If unspecified, script_name is used.

inputs
list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData]]
Required

List of input port bindings

outputs
list[Union[PipelineData, <xref:azureml.pipeline.core.pipeline_output_dataset.PipelineAbstractOutputDataset>, OutputPortBinding]]
Required

A list of output port bindings.

params
dict
Required

A dictionary of name-value pairs.

degree_of_parallelism
int
Required

The degree of parallelism to use for this job. This must be greater than 0. If set to less than 0, defaults to 1.

priority
int
Required

The priority value to use for the current job. Lower numbers have a higher priority. By default, a job has a priority of 1000. The value you specify must be greater than 0.

runtime_version
str
Required

The runtime version of the Data Lake Analytics engine.

compute_target
AdlaCompute, str
Required

[Required] The ADLA compute to use for this job.

source_directory
str
Required

A folder that contains the script, assemblies etc.

allow_reuse
bool
Required

Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.

version
str
Required

An optional version tag to denote a change in functionality for the step.

hash_paths
list
Required

DEPRECATED: no longer needed.

A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of source_directory is hashed except for files listed in .amlignore or .gitignore.

Remarks

You can use @@name@@ syntax in your script to refer to inputs, outputs, and params.

  • if name is the name of an input or output port binding, any occurrences of @@name@@ in the script are replaced with the actual data path of a corresponding port binding.

  • if name matches any key in params dict, any occurrences of @@name@@ will be replaced with corresponding value in dict.

AdlaStep works only with data stored in the default Data Lake Storage of the Data Lake Analytics account. If the data is in a non-default storage, use a DataTransferStep to copy the data to the default storage. You can find the default storage by opening your Data Lake Analytics account in the Azure portal and then navigating to 'Data sources' item under Settings in the left pane.

The following example shows how to use AdlaStep in an Azure Machine Learning Pipeline.


   adla_step = AdlaStep(
       name='extract_employee_names',
       script_name='sample_script.usql',
       source_directory=sample_folder,
       inputs=[sample_input],
       outputs=[sample_output],
       compute_target=adla_compute)

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-adla-as-compute-target.ipynb

Methods

create_node

Create a node from the AdlaStep step and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node

Create a node from the AdlaStep step and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Parameters

graph
Graph
Required

The graph object.

default_datastore
Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore]
Required

The default datastore.

context
<xref:azureml.pipeline.core._GraphContext>
Required

The graph context.

Returns

The node object.

Return type