DatabricksStep Class

Creates an Azure ML Pipeline step to add a DataBricks notebook, Python script, or JAR as a node.

For an example of using DatabricksStep, see the notebook https://aka.ms/pl-databricks.

Create an Azure ML Pipeline step to add a DataBricks notebook, Python script, or JAR as a node.

For an example of using DatabricksStep, see the notebook https://aka.ms/pl-databricks.

:param python_script_name:[Required] The name of a Python script relative to source_directory. If the script takes inputs and outputs, those will be passed to the script as parameters. If python_script_name is specified then source_directory must be too.

Specify exactly one of notebook_path, python_script_path, python_script_name, or main_class_name.

If you specify a DataReference object as input with data_reference_name=input1 and a PipelineData object as output with name=output1, then the inputs and outputs will be passed to the script as parameters. This is how they will look like and you will need to parse the arguments in your script to access the paths of each input and output: "-input1","wasbs://test@storagename.blob.core.windows.net/test","-output1", "wasbs://test@storagename.blob.core.windows.net/b3e26de1-87a4-494d-a20f-1988d22b81a2/output1"

In addition, the following parameters will be available within the script:

AZUREML_RUN_TOKEN: The AML token for authenticating with Azure Machine Learning.
AZUREML_RUN_TOKEN_EXPIRY: The AML token expiry time.
AZUREML_RUN_ID: Azure Machine Learning Run ID for this run.
AZUREML_ARM_SUBSCRIPTION: Azure subscription for your AML workspace.
AZUREML_ARM_RESOURCEGROUP: Azure resource group for your Azure Machine Learning workspace.
AZUREML_ARM_WORKSPACE_NAME: Name of your Azure Machine Learning workspace.
AZUREML_ARM_PROJECT_NAME: Name of your Azure Machine Learning experiment.
AZUREML_SERVICE_ENDPOINT: The endpoint URL for AML services.
AZUREML_WORKSPACE_ID: ID of your Azure Machine Learning workspace.
AZUREML_EXPERIMENT_ID: ID of your Azure Machine Learning experiment.
AZUREML_SCRIPT_DIRECTORY_NAME: Directory path in DBFS where source_directory has been copied.

  (This parameter is only populated when `python_script_name` is used.  See more details below.)

When you are executing a Python script from your local machine on Databricks using DatabricksStep parameters source_directory and python_script_name, your source_directory is copied over to DBFS and the directory path on DBFS is passed as a parameter to your script when it begins execution. This parameter is labelled as –AZUREML_SCRIPT_DIRECTORY_NAME. You need to prefix it with the string "dbfs:/" or "/dbfs/" to access the directory in DBFS.

Constructor

DatabricksStep(name, inputs=None, outputs=None, existing_cluster_id=None, spark_version=None, node_type=None, instance_pool_id=None, num_workers=None, min_workers=None, max_workers=None, spark_env_variables=None, spark_conf=None, init_scripts=None, cluster_log_dbfs_path=None, notebook_path=None, notebook_params=None, python_script_path=None, python_script_params=None, main_class_name=None, jar_params=None, python_script_name=None, source_directory=None, hash_paths=None, run_name=None, timeout_seconds=None, runconfig=None, maven_libraries=None, pypi_libraries=None, egg_libraries=None, jar_libraries=None, rcran_libraries=None, compute_target=None, allow_reuse=True, version=None, permit_cluster_restart=None)

Parameters

Name	Description
name Required	str [Required] The name of the step.
inputs	list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData]] A list of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get("input_name"). Can be DataReference or PipelineData. DataReference represents an existing piece of data on a datastore. Essentially this is a path on a datastore. DatabricksStep supports datastores that encapsulates DBFS, Azure blob, or ADLS v1. PipelineData represents intermediate data produced by another step in a pipeline. Default value: None
outputs	list[Union[OutputPortBinding, PipelineOutputAbstractDataset, PipelineData]] A list of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get("output_name"). Should be PipelineData. Default value: None
existing_cluster_id	str A cluster ID of an existing interactive cluster on the Databricks workspace. If you are passing this parameter, you cannot pass any of the following parameters which are used to create a new cluster: spark_version node_type instance_pool_id num_workers min_workers max_workers spark_env_variables spark_conf Note: For creating a new job cluster, you will need to pass the above parameters. You can pass these parameters directly or you can pass them as part of the RunConfiguration object using the runconfig parameter. Passing these parameters directly and through RunConfiguration results in an error. Default value: None
spark_version	str The version of spark for the Databricks run cluster, for example: "10.4.x-scala2.12". For more information, see the description for the `existing_cluster_id` parameter. Default value: None
node_type	str [Required] The Azure VM node types for the Databricks run cluster, for example: "Standard_D3_v2". Specify either `node_type` or `instance_pool_id`. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
instance_pool_id	str [Required] The instance pool ID to which the cluster needs to be attached to. Specify either `node_type` or `instance_pool_id`. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
num_workers	int [Required] The static number of workers for the Databricks run cluster. You must specify either `num_workers` or both `min_workers` and `max_workers`. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
min_workers	int [Required] The min number of workers to use for auto-scaling the Databricks run cluster. You must specify either `num_workers` or both `min_workers` and `max_workers`. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
max_workers	int [Required] The max number of workers to use for auto-scaling the Databricks run cluster. You must specify either `num_workers` or both `min_workers` and `max_workers`. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
spark_env_variables	dict The spark environment variables for the Databricks run cluster. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
spark_conf	dict The spark configuration for the Databricks run cluster. For more information, see the description for the `existing_cluster_id` parameter. Default value: None
init_scripts	[str] Deprecated. Databricks announced the init script stored in DBFS will stop work after Dec 1, 2023. To mitigate the issue, please 1) use global init scripts in databricks following https://learn.microsoft.com/azure/databricks/init-scripts/global 2) comment out the line of init_scripts in your AzureML databricks step. Default value: None
cluster_log_dbfs_path	str The DBFS paths where clusters logs are to be delivered. Default value: None
notebook_path	str [Required] The path to the notebook in the Databricks instance. This class allows four ways of specifying the code to execute on the Databricks cluster. To execute a notebook that is present in the Databricks workspace, use: notebook_path=notebook_path, notebook_params={'myparam': 'testparam'} To execute a Python script that is present in DBFS, use: python_script_path=python_script_dbfs_path, python_script_params={'arg1', 'arg2'} To execute a JAR that is present in DBFS, use: main_class_name=main_jar_class_name, jar_params={'arg1', 'arg2'}, jar_libraries=[JarLibrary(jar_library_dbfs_path)] To execute a Python script that is present on your local machine, use: python_script_name=python_script_name, source_directory=source_directory Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`. Default value: None
notebook_params	dict[str, Union[str, PipelineParameter]] A dictionary of parameters to pass to the notebook. `notebook_params` are available as widgets. You can fetch the values from these widgets inside your notebook using dbutils.widgets.get("myparam"). Default value: None
python_script_path	str [Required] The path to the python script in the DBFS. Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`. Default value: None
python_script_params	list[str, PipelineParameter] Parameters for the Python script. Default value: None
main_class_name	str [Required] The name of the entry point in a JAR module. Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`. Default value: None
jar_params	list[str, PipelineParameter] Parameters for the JAR module. Default value: None
python_script_name	str [Required] The name of a Python script relative to `source_directory`. If the script takes inputs and outputs, those will be passed to the script as parameters. If `python_script_name` is specified then `source_directory` must be too. Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`. If you specify a DataReference object as input with data_reference_name=input1 and a PipelineData object as output with name=output1, then the inputs and outputs will be passed to the script as parameters. This is how they will look like and you will need to parse the arguments in your script to access the paths of each input and output: "-input1","wasbs://test@storagename.blob.core.windows.net/test","-output1", "wasbs://test@storagename.blob.core.windows.net/b3e26de1-87a4-494d-a20f-1988d22b81a2/output1" In addition, the following parameters will be available within the script: AZUREML_RUN_TOKEN: The AML token for authenticating with Azure Machine Learning. AZUREML_RUN_TOKEN_EXPIRY: The AML token expiry time. AZUREML_RUN_ID: Azure Machine Learning Run ID for this run. AZUREML_ARM_SUBSCRIPTION: Azure subscription for your AML workspace. AZUREML_ARM_RESOURCEGROUP: Azure resource group for your Azure Machine Learning workspace. AZUREML_ARM_WORKSPACE_NAME: Name of your Azure Machine Learning workspace. AZUREML_ARM_PROJECT_NAME: Name of your Azure Machine Learning experiment. AZUREML_SERVICE_ENDPOINT: The endpoint URL for AML services. AZUREML_WORKSPACE_ID: ID of your Azure Machine Learning workspace. AZUREML_EXPERIMENT_ID: ID of your Azure Machine Learning experiment. AZUREML_SCRIPT_DIRECTORY_NAME: Directory path in DBFS where source_directory has been copied. (This parameter is only populated when `python_script_name` is used. See more details below.) When you are executing a Python script from your local machine on Databricks using DatabricksStep parameters `source_directory` and `python_script_name`, your source_directory is copied over to DBFS and the directory path on DBFS is passed as a parameter to your script when it begins execution. This parameter is labelled as –AZUREML_SCRIPT_DIRECTORY_NAME. You need to prefix it with the string "dbfs:/" or "/dbfs/" to access the directory in DBFS. Default value: None
source_directory	str The folder that contains the script and other files. If `python_script_name` is specified then `source_directory` must be too. Default value: None
hash_paths	[str] DEPRECATED: no longer needed. A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of `source_directory` is hashed except for files listed in .amlignore or .gitignore. Default value: None
run_name	str The name in Databricks for this run. Default value: None
timeout_seconds	int The timeout for the Databricks run. Default value: None
runconfig	RunConfiguration The runconfig to use. Note: You can pass as many libraries as you like as dependencies to your job using the following parameters: `maven_libraries`, `pypi_libraries`, `egg_libraries`, `jar_libraries`, or `rcran_libraries`. Either pass these parameters directly with their corresponding parameters or as part of the RunConfiguration object using the `runconfig` parameter, but not both. Default value: None
maven_libraries	list[MavenLibrary] Maven libraries to use for the Databricks run. Default value: None
pypi_libraries	list[PyPiLibrary] PyPi libraries to use for the Databricks run. Default value: None
egg_libraries	list[EggLibrary] Egg libraries to use for the Databricks run. Default value: None
jar_libraries	list[JarLibrary] Jar libraries to use for the Databricks run. Default value: None
rcran_libraries	list[RCranLibrary] RCran libraries to use for the Databricks run. Default value: None
compute_target	str, DatabricksCompute [Required] An Azure Databricks compute. Before you can use DatabricksStep to execute your scripts or notebooks on an Azure Databricks workspace, you need to add the Azure Databricks workspace as a compute target to your Azure Machine Learning workspace. Default value: None
allow_reuse	bool Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed. Default value: True
version	str An optional version tag to denote a change in functionality for the step. Default value: None
permit_cluster_restart	bool if existing_cluster_id is specified, this parameter tells whether cluster can be restarted on behalf of user. Default value: None
name Required	str [Required] The name of the step.
inputs Required	list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData]] List of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get("input_name"). Can be DataReference or PipelineData. DataReference represents an existing piece of data on a datastore. Essentially this is a path on a datastore. DatabricksStep supports datastores that encapsulates DBFS, Azure blob or ADLS v1. PipelineData represents intermediate data produced by another step in a pipeline.
outputs Required	list[Union[OutputPortBinding, <xref:azureml.pipeline.core.pipeline_output_dataset.PipelineOutputDataset>, PipelineData]] A list of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get("output_name"). Should be PipelineData.
existing_cluster_id Required	str A cluster ID of an existing interactive cluster on the Databricks workspace. If you are passing this parameter, you cannot pass any of the following parameters which are used to create a new cluster: spark_version node_type instance_pool_id num_workers min_workers max_workers spark_env_variables spark_conf Note: For creating a new job cluster, you will need to pass the above parameters. You can pass these parameters directly or you can pass them as part of the RunConfiguration object using the runconfig parameter. Passing these parameters directly and through RunConfiguration results in an error.
spark_version Required	str The version of spark for the Databricks run cluster, for example: "10.4.x-scala2.12". For more information, see the description for the `existing_cluster_id` parameter.
node_type Required	str [Required] The Azure VM node types for the Databricks run cluster, for example: "Standard_D3_v2". Specify either `node_type` or `instance_pool_id`. For more information, see the description for the `existing_cluster_id` parameter.
instance_pool_id Required	str [Required] The instance pool ID to which the cluster needs to be attached to. Specify either `node_type` or `instance_pool_id`. For more information, see the description for the `existing_cluster_id` parameter.
num_workers Required	int [Required] The static number of workers for the Databricks run cluster. You must specify either `num_workers` or both `min_workers` and `max_workers`. For more information, see the description for the `existing_cluster_id` parameter.
min_workers Required	int [Required] The min number of workers to use for auto-scaling the Databricks run cluster. You must specify either `num_workers` or both `min_workers` and `max_workers`. For more information, see the description for the `existing_cluster_id` parameter.
max_workers Required	int [Required] The max number of workers to use for auto-scaling the Databricks run cluster. You must specify either `num_workers` or both `min_workers` and `max_workers`. For more information, see the description for the `existing_cluster_id` parameter.
spark_env_variables Required	dict The spark environment variables for the Databricks run cluster. For more information, see the description for the `existing_cluster_id` parameter.
spark_conf Required	dict The spark configuration for the Databricks run cluster. For more information, see the description for the `existing_cluster_id` parameter.
init_scripts Required	[str] Deprecated. Databricks announced the init script stored in DBFS will stop work after Dec 1, 2023. To mitigate the issue, please 1) use global init scripts in databricks following https://learn.microsoft.com/azure/databricks/init-scripts/global 2) comment out the line of init_scripts in your AzureML databricks step.
cluster_log_dbfs_path Required	str The DBFS paths where clusters logs are to be delivered.
notebook_path Required	str [Required] The path to the notebook in the Databricks instance. This class allows four ways of specifying the code to execute on the Databricks cluster. To execute a notebook that is present in the Databricks workspace, use: notebook_path=notebook_path, notebook_params={'myparam': 'testparam'} To execute a Python script that is present in DBFS, use: python_script_path=python_script_dbfs_path, python_script_params={'arg1', 'arg2'} To execute a JAR that is present in DBFS, use: main_class_name=main_jar_class_name, jar_params={'arg1', 'arg2'}, jar_libraries=[JarLibrary(jar_library_dbfs_path)] To execute a Python script that is present on your local machine, use: python_script_name=python_script_name, source_directory=source_directory Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`.
notebook_params Required	dict[str, (str or PipelineParameter)] A dictionary of parameters to pass to the notebook. `notebook_params` are available as widgets. You can fetch the values from these widgets inside your notebook using dbutils.widgets.get("myparam").
python_script_path Required	str [Required] The path to the python script in the DBFS. Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`.
python_script_params Required	list[str, PipelineParameter] Parameters for the Python script.
main_class_name Required	str [Required] The name of the entry point in a JAR module. Specify exactly one of `notebook_path`, `python_script_path`, `python_script_name`, or `main_class_name`.
jar_params Required	list[str, PipelineParameter] Parameters for the JAR module.
source_directory Required	str The folder that contains the script and other files. If `python_script_name` is specified then `source_directory` must be too.
hash_paths Required	[str] DEPRECATED: no longer needed. A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of `source_directory` is hashed except for files listed in .amlignore or .gitignore.
run_name Required	str The name in Databricks for this run.
timeout_seconds Required	int The timeout for the Databricks run.
runconfig Required	RunConfiguration The runconfig to use. Note: You can pass as many libraries as you like as dependencies to your job using the following parameters: `maven_libraries`, `pypi_libraries`, `egg_libraries`, `jar_libraries`, or `rcran_libraries`. Either pass these parameters directly with their corresponding parameters or as part of the RunConfiguration object using the `runconfig` parameter, but not both.
maven_libraries Required	list[<xref:azureml.core.runconfig.MavenLibrary>] Maven libraries to use for the Databricks run. For more information on the specification of Maven libraries, see `help(azureml.core.runconfig.MavenLibrary)`.
pypi_libraries Required	list[<xref:azureml.core.runconfig.PyPiLibrary>] PyPi libraries to use for the Databricks run. For more information on the specification of PyPi libraries, see `help(azureml.core.runconfig.PyPiLibrary)`.
egg_libraries Required	list[<xref:azureml.core.runconfig.EggLibrary>] Egg libraries to use for the Databricks run. For more information on the specification of Egg libraries, see `help(azureml.core.runconfig.EggLibrary)`.
jar_libraries Required	list[<xref:azureml.core.runconfig.JarLibrary>] Jar libraries to use for the Databricks run. For more information on the specification of Jar libraries, see `help(azureml.core.runconfig.JarLibrary)`.
rcran_libraries Required	list[<xref:azureml.core.runconfig.RCranLibrary>] RCran libraries to use for the Databricks run. For more information on the specification of RCran libraries, see `help(azureml.core.runconfig.RCranLibrary)`.
compute_target Required	str, DatabricksCompute [Required] Azure Databricks compute. Before you can use DatabricksStep to execute your scripts or notebooks on an Azure Databricks workspace, you need to add the Azure Databricks workspace as a compute target to your Azure Machine Learning workspace.
allow_reuse Required	bool Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.
version Required	str An optional version tag to denote a change in functionality for the step.
permit_cluster_restart Required	bool if existing_cluster_id is specified, this parameter tells whether cluster can be restarted on behalf of user.

Methods

create_node

Create a node from the Databricks step and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node

Create a node from the Databricks step and add it to the specified graph.

create_node(graph, default_datastore, context)

Parameters

Name	Description
graph Required	Graph The graph object to add the node to.
default_datastore Required	Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore] The default datastore.
context Required	<xref:azureml.pipeline.core._GraphContext> The graph context.

Returns

Type	Description
Node	The created node.

Share via

DatabricksStep Class

Constructor

Parameters

Methods

create_node

Parameters

Returns

Feedback