Adding input and output parameter to a DatabricksStep

Question

Adding input and output parameter to a DatabricksStep

Alexander Pakakis 46

Hello,

We are using AMLS for creating and registering a pipeline which runs on a pre-defined Databricks cluster.
In the AMLS workspace, there is our Databricks notebook which should be executed in the DatabricksStep.

We want to save a file into a Blob-storage container. Therefore, we have added the parameters "outputs" and "notebook_params" to the DatabricksStep:

We would like to know how we can retrieve the output folder path within the Databricks notebook with the name "basic_DatabricksStep_script.py".
With PythonScriptStep this worked using the following commands:

import argparse  
parser = argparse.ArgumentParser()  
parser.add_argument('output', type=str, dest='output', default='output', help='given output data folder name')   
args = parser.parse_args()  
output_data_folder_path = args.output

How will this work with a DatabricksStep?

We are aware of this notebook, but we need additional support to solve our issue.
It would be great if you could provide exemplary code and also show us how we can add the input parameter to the DatabricksStep so that we can read Datasets which are registered in AMLS.

Thank you in advance for your efforts!

With best regards
Alex

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2022-06-09T08:56:27.667+00:00
@AlexanderPakakis-0994 For the scenario where you would like to use the PythonScriptStep type parsing for inputs to the python scripts. You could add the parameter python_script_params and pass the required inputs to the script. As highlighted in one of the earlier issue.

Parameter to add in databricks step:

python_script_params=["--dataset_path", 'batch-inference-data'],

Parsing in the script:

import argparse parser = argparse.ArgumentParser() parser.add_argument('dataset_path', type=str) args, unknown = parser.parse_known_args() output_data_folder_path = args.output

The sample notebook mentions this at the end of notebook with more details to load the other default Azure ML workspace details.
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2022-06-09T08:57:05.9+00:00
The other scenario to use the datasets of Azure ML as a data reference. I think this should work.

def_blob_store = Datastore(ws, "workspaceblobstore") def_blob_store.upload_files(files=["./testdata.txt"], target_path="dbtest", overwrite=False) step_1_input = DataReference(datastore=def_blob_store, path_on_datastore="dbtest", data_reference_name="input1")

In the DatabricksStep() after passing the input as

inputs=[step_1_input],

The inputs and outputs should be available as parameters to the script. You will need to parse the arguments in your script to access the paths with the datareference name i.e input1 in this case.

The value should be available as:

"-input1","wasbs://******@storagename.blob.core.windows.net/dbtest"
Alexander Pakakis 46 Reputation points

2022-06-09T12:30:36.947+00:00

@romungi-MSFT

Thank you for your answer!

Let us please forget PythonScriptStep since I just mentioned this class to make my question clearer!

How can I add the output parameter in a DatabricksStep for saving files in a Blob storage container?

With best regards
Alex

1 answer

Your answer

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2022-06-09T08:56:27.667+00:00

@AlexanderPakakis-0994 For the scenario where you would like to use the PythonScriptStep type parsing for inputs to the python scripts. You could add the parameter python_script_params and pass the required inputs to the script. As highlighted in one of the earlier issue.

Parameter to add in databricks step:

python_script_params=["--dataset_path", 'batch-inference-data'],

Parsing in the script:

import argparse parser = argparse.ArgumentParser() parser.add_argument('dataset_path', type=str) args, unknown = parser.parse_known_args() output_data_folder_path = args.output

The sample notebook mentions this at the end of notebook with more details to load the other default Azure ML workspace details.
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2022-06-09T08:57:05.9+00:00

The other scenario to use the datasets of Azure ML as a data reference. I think this should work.

def_blob_store = Datastore(ws, "workspaceblobstore") def_blob_store.upload_files(files=["./testdata.txt"], target_path="dbtest", overwrite=False) step_1_input = DataReference(datastore=def_blob_store, path_on_datastore="dbtest", data_reference_name="input1")

In the DatabricksStep() after passing the input as

inputs=[step_1_input],

The inputs and outputs should be available as parameters to the script. You will need to parse the arguments in your script to access the paths with the datareference name i.e input1 in this case.

The value should be available as:

"-input1","wasbs://******@storagename.blob.core.windows.net/dbtest"
Alexander Pakakis 46 Reputation points

2022-06-09T12:30:36.947+00:00

@romungi-MSFT

Thank you for your answer!

Let us please forget PythonScriptStep since I just mentioned this class to make my question clearer!

How can I add the output parameter in a DatabricksStep for saving files in a Blob storage container?

With best regards
Alex

Answer 1

@romungi-MSFT

Thank you for the answer.

We are still not able to save a file into a Blob storage container.
It is important that we use the DatabricksStep, but irrelevant whether the Notebook (Python script basic_DatabricksStep_script.py) is in AMLS or the Databricks workspace.

How we try to save a file into the Blob storage container:

Here is the Python script, which should be executed in the DatabricksStep

%%writefile $source_directory/basic_DatabricksStep_script.py  
  
dbutils.widgets.get("input")  
i = getArgument("input")  
print ("Param -\'input':")  
print (i)  
  
dbutils.widgets.get("output")  
dbutils.widgets.get("output")  
o = getArgument("output")  
print ("Param -\'output':")  
print (o)  
data = [('value1', 'value2')]  
df2 = spark.createDataFrame(data)  
  
z = o + "/output.txt"  
df2.write.csv(z)

This is how we define the DatabricksStep

def_blob_store = Datastore(ws, "input_datastore")  
step_1_input = DataReference(datastore=def_blob_store, path_on_datastore="dbtest",  
                                     data_reference_name="input")  
  
output_data_folder_name = "output"  
output_data_folder = PipelineData(output_data_folder_name, Datastore.get(ws, "output_datastore"))  
   
dbNbWithExistingClusterStep = DatabricksStep(  
    name="DBFSReferenceWithExisting",  
    run_name='DBFS_Reference_With_Existing',  
    source_directory = source_directory,  
    python_script_name = "basic_DatabricksStep_script.py",  
 inputs=[step_1_input],  
    outputs=[output_data_folder],  
    compute_target=databricks_compute,  
    existing_cluster_id="XXXXXX",  
    allow_reuse=True,  
    permit_cluster_restart=True  
)

Here is a picture for making it clearer what we want to achieve:

Currently, our pipeline is not getting built by AMLS even though we followed the examples of the official GitHub notebook for learning about the DatabricksStep class.
Can you make our pipeline work, please?

Thank you in advance for your support!

With best regards,
Alex

Share via

Adding input and output parameter to a DatabricksStep

1 answer

Your answer