Adding input and output parameter to a DatabricksStep

Alexander Pakakis 46 Reputation points
2022-06-08T17:42:34.713+00:00

@romungi-MSFT

Hello,

We are using AMLS for creating and registering a pipeline which runs on a pre-defined Databricks cluster.
In the AMLS workspace, there is our Databricks notebook which should be executed in the DatabricksStep.

We want to save a file into a Blob-storage container. Therefore, we have added the parameters "outputs" and "notebook_params" to the DatabricksStep:
209539-image.png

We would like to know how we can retrieve the output folder path within the Databricks notebook with the name "basic_DatabricksStep_script.py".
With PythonScriptStep this worked using the following commands:

import argparse  
parser = argparse.ArgumentParser()  
parser.add_argument('output', type=str, dest='output', default='output', help='given output data folder name')   
args = parser.parse_args()  
output_data_folder_path = args.output  

How will this work with a DatabricksStep?

We are aware of this notebook, but we need additional support to solve our issue.
It would be great if you could provide exemplary code and also show us how we can add the input parameter to the DatabricksStep so that we can read Datasets which are registered in AMLS.

Thank you in advance for your efforts!

With best regards
Alex

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,337 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,526 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Alexander Pakakis 46 Reputation points
    2022-06-10T09:15:46.96+00:00

    @romungi-MSFT

    Thank you for the answer.

    We are still not able to save a file into a Blob storage container.
    It is important that we use the DatabricksStep, but irrelevant whether the Notebook (Python script basic_DatabricksStep_script.py) is in AMLS or the Databricks workspace.

    How we try to save a file into the Blob storage container:

    Here is the Python script, which should be executed in the DatabricksStep

    %%writefile $source_directory/basic_DatabricksStep_script.py  
      
    dbutils.widgets.get("input")  
    i = getArgument("input")  
    print ("Param -\'input':")  
    print (i)  
      
    dbutils.widgets.get("output")  
    dbutils.widgets.get("output")  
    o = getArgument("output")  
    print ("Param -\'output':")  
    print (o)  
    data = [('value1', 'value2')]  
    df2 = spark.createDataFrame(data)  
      
    z = o + "/output.txt"  
    df2.write.csv(z)  
    

    This is how we define the DatabricksStep

    def_blob_store = Datastore(ws, "input_datastore")  
    step_1_input = DataReference(datastore=def_blob_store, path_on_datastore="dbtest",  
                                         data_reference_name="input")  
      
    output_data_folder_name = "output"  
    output_data_folder = PipelineData(output_data_folder_name, Datastore.get(ws, "output_datastore"))  
       
    dbNbWithExistingClusterStep = DatabricksStep(  
        name="DBFSReferenceWithExisting",  
        run_name='DBFS_Reference_With_Existing',  
        source_directory = source_directory,  
        python_script_name = "basic_DatabricksStep_script.py",  
     inputs=[step_1_input],  
        outputs=[output_data_folder],  
        compute_target=databricks_compute,  
        existing_cluster_id="XXXXXX",  
        allow_reuse=True,  
        permit_cluster_restart=True  
    )  
    

    Here is a picture for making it clearer what we want to achieve:
    210165-image.png

    Currently, our pipeline is not getting built by AMLS even though we followed the examples of the official GitHub notebook for learning about the DatabricksStep class.
    Can you make our pipeline work, please?

    Thank you in advance for your support!

    With best regards,
    Alex

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.