SynapseSparkStep Class

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Creates an Azure ML Synapse step that submit and execute Python script.

Create an Azure ML Pipeline step that runs spark job on synapse spark pool.

Inheritance
azureml.pipeline.core._synapse_spark_step_base._SynapseSparkStepBase
SynapseSparkStep

Constructor

SynapseSparkStep(file, source_directory, compute_target, driver_memory, driver_cores, executor_memory, executor_cores, num_executors, name=None, app_name=None, environment=None, arguments=None, inputs=None, outputs=None, conf=None, py_files=None, jars=None, files=None, allow_reuse=True, version=None)

Parameters

file
str
Required

The name of a synapse script relative to source_directory.

source_directory
str
Required

A folder that contains Python script, conda env, and other resources used in the step.

compute_target
SynapseCompute or str
Required

The compute target to use.

driver_memory
str
Required

Amount of memory to use for the driver process.

driver_cores
int
Required

Number of cores to use for the driver process.

executor_memory
str
Required

Amount of memory to use per executor process.

executor_cores
int
Required

Number of cores to use for each executor.

num_executors
int
Required

Number of executors to launch for this session.

name
str
Required

The name of the step. If unspecified, file is used.

app_name
str
Required

The App name used to submit the spark job.

environment
Environment
Required

AML environment will be supported in later release.

arguments
list
Required

Command line arguments for the Synapse script file.

inputs
list[DatasetConsumptionConfig]
Required

A list of inputs.

outputs
list[HDFSOutputDatasetConfig]
Required

A list of outputs.

conf
dict
Required

Spark configuration properties.

py_files
list
Required

Python files to be used in this session, parameter of livy API.

files
list
Required

Files to be used in this session, parameter of livy API.

allow_reuse
bool
Required

Indicates if the step should reuse previous results when re-run with the same settings.

version
str
Required

An optional version tag to denote a change in functionality for the step.

file
str
Required

The name of a Synapse script relative to source_directory.

source_directory
str
Required

A folder that contains Python script, conda env, and other resources used in the step.

compute_target
SynapseCompute or str
Required

The compute target to use.

driver_memory
str
Required

Amount of memory to use for the driver process.

driver_cores
int
Required

Number of cores to use for the driver process.

executor_memory
str
Required

Amount of memory to use per executor process.

executor_cores
int
Required

Number of cores to use for each executor.

num_executors
int
Required

Number of executors to launch for this session.

name
str
Required

The name of the step. If unspecified, file is used.

app_name
str
Required

The App name used to submit the Apache Spark job.

environment
Environment
Required

AML environment that will be leveraged in this SynapseSparkStep.

arguments
list
Required

Command line arguments for the Synapse script file.

inputs
list[DatasetConsumptionConfig]
Required

A list of inputs.

outputs
list[HDFSOutputDatasetConfig]
Required

A list of outputs.

conf
dict
Required

Spark configuration properties.

py_files
list
Required

Python files to be used in this session, parameter of livy API.

jars
list
Required

Jar files to be used in this session, parameter of livy API.

files
list
Required

Files to be used in this session, parameter of livy API.

allow_reuse
bool
Required

Indicates if the step should reuse previous results when re-run with the same settings.

version
str
Required

An optional version tag to denote a change in functionality for the step.

Remarks

A SynapseSparkStep is a basic, built-in step to run a Python Spark job on a synapse spark pools. It takes a main file name and other optional parameters like arguments for the script, compute target, inputs and outputs.

The best practice for working with SynapseSparkStep is to use a separate folder for scripts and any dependent files associated with the step, and specify that folder with the source_directory parameter. Following this best practice has two benefits. First, it helps reduce the size of the snapshot created for the step because only what is needed for the step is snapshotted. Second, the step's output from a previous run can be reused if there are no changes to the source_directory that would trigger a re-upload of the snapshot.


   from azureml.core import Dataset
   from azureml.pipeline.steps import SynapseSparkStep
   from azureml.data import HDFSOutputDatasetConfig

   # get input dataset
   input_ds = Dataset.get_by_name(workspace, "weather_ds").as_named_input("weather_ds")

   # register pipeline output as dataset
   output_ds = HDFSOutputDatasetConfig("synapse_step_output",
                                       destination=(ws.datastores['datastore'],"dir")
                                       ).register_on_complete(name="registered_dataset")

   step_1 = SynapseSparkStep(
       name = "synapse_step",
       file = "pyspark_job.py",
       source_directory="./script",
       inputs=[input_ds],
       outputs=[output_ds],
       compute_target = "synapse",
       driver_memory = "7g",
       driver_cores = 4,
       executor_memory = "7g",
       executor_cores = 2,
       num_executors = 1,
       conf = {})

SynapseSparkStep only supports DatasetConsumptionConfig as input and HDFSOutputDatasetConfig as output.

Methods

create_node

Create a node for Synapse script step.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node

Create a node for Synapse script step.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Parameters

graph
Graph
Required

The graph object to add the node to.

default_datastore
Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore]
Required

The default datastore.

context
<xref:azureml.pipeline.core._GraphContext>
Required

The graph context.

Returns

The created node.

Return type