SynapseSparkStep Class
Note
This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Creates an Azure ML Synapse step that submit and execute Python script.
Create an Azure ML Pipeline step that runs spark job on synapse spark pool.
- Inheritance
-
azureml.pipeline.core._synapse_spark_step_base._SynapseSparkStepBaseSynapseSparkStep
Constructor
SynapseSparkStep(file, source_directory, compute_target, driver_memory, driver_cores, executor_memory, executor_cores, num_executors, name=None, app_name=None, environment=None, arguments=None, inputs=None, outputs=None, conf=None, py_files=None, jars=None, files=None, allow_reuse=True, version=None)
Parameters
- source_directory
- str
A folder that contains Python script, conda env, and other resources used in the step.
- allow_reuse
- bool
Indicates if the step should reuse previous results when re-run with the same settings.
- source_directory
- str
A folder that contains Python script, conda env, and other resources used in the step.
- allow_reuse
- bool
Indicates if the step should reuse previous results when re-run with the same settings.
Remarks
A SynapseSparkStep is a basic, built-in step to run a Python Spark job on a synapse spark pools. It takes a main file name and other optional parameters like arguments for the script, compute target, inputs and outputs.
The best practice for working with SynapseSparkStep is to use a separate folder for scripts and any dependent
files associated with the step, and specify that folder with the source_directory
parameter.
Following this best practice has two benefits. First, it helps reduce the size of the snapshot
created for the step because only what is needed for the step is snapshotted. Second, the step's output
from a previous run can be reused if there are no changes to the source_directory
that would trigger
a re-upload of the snapshot.
from azureml.core import Dataset
from azureml.pipeline.steps import SynapseSparkStep
from azureml.data import HDFSOutputDatasetConfig
# get input dataset
input_ds = Dataset.get_by_name(workspace, "weather_ds").as_named_input("weather_ds")
# register pipeline output as dataset
output_ds = HDFSOutputDatasetConfig("synapse_step_output",
destination=(ws.datastores['datastore'],"dir")
).register_on_complete(name="registered_dataset")
step_1 = SynapseSparkStep(
name = "synapse_step",
file = "pyspark_job.py",
source_directory="./script",
inputs=[input_ds],
outputs=[output_ds],
compute_target = "synapse",
driver_memory = "7g",
driver_cores = 4,
executor_memory = "7g",
executor_cores = 2,
num_executors = 1,
conf = {})
SynapseSparkStep only supports DatasetConsumptionConfig as input and HDFSOutputDatasetConfig as output.
Methods
create_node |
Create a node for Synapse script step. This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow. |
create_node
Create a node for Synapse script step.
This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.
create_node(graph, default_datastore, context)
Parameters
- default_datastore
- Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore]
The default datastore.
- context
- <xref:azureml.pipeline.core._GraphContext>
The graph context.
Returns
The created node.
Return type
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for