Using AML Pipeline with PythonScriptStep
- Install Required Libraries: Make sure you have the necessary libraries installed:
pip install azureml-core azureml-pipeline-steps databricks-api
- Set Up Authentication: Authenticate to both Azure ML and Databricks:
from azureml.core import Workspace from databricks_api import DatabricksAPI # Azure ML authentication ws = Workspace.from_config() # Databricks authentication DATABRICKS_URL = 'https://<your-databricks-instance>' DATABRICKS_TOKEN = '<your-databricks-token>' db = DatabricksAPI(host=DATABRICKS_URL, token=DATABRICKS_TOKEN)
- Create a Python Script to Trigger Databricks Notebooks: Create a Python script (
run_databricks.py
) that will trigger the execution of your Databricks notebook.import sys from databricks_api import DatabricksAPI DATABRICKS_URL = 'https://<your-databricks-instance>' DATABRICKS_TOKEN = '<your-databricks-token>' db = DatabricksAPI(host=DATABRICKS_URL, token=DATABRICKS_TOKEN) def run_databricks_notebook(notebook_path): run_id = db.jobs.submit_run(notebook_task={"notebook_path": notebook_path}) return run_id if __name__ == "__main__": notebook_path = sys.argv[1] run_id = run_databricks_notebook(notebook_path) print(f"Triggered Databricks notebook run with ID: {run_id}")
- Define the AML Pipeline: Define a pipeline in Azure ML that uses
PythonScriptStep
to run the Databricks notebook.from azureml.core import Experiment from azureml.pipeline.core import Pipeline from azureml.pipeline.steps import PythonScriptStep notebook_path = '/Users/<your-user>/notebook' step = PythonScriptStep( script_name="run_databricks.py", arguments=[notebook_path], compute_target='your-compute-cluster', source_directory='.', allow_reuse=False ) pipeline = Pipeline(workspace=ws, steps=[step]) experiment = Experiment(workspace=ws, name='databricks-integration') pipeline_run = experiment.submit(pipeline)
Using Azure ML Databricks Linked Service (Preview Feature)
- Create Databricks Linked Service: Create a Databricks linked service in Azure Machine Learning. This feature is currently in preview, and the API might change in the future.
- Define the Databricks Job: Define a job in AML that points to the Databricks notebook.
from azureml.core import Workspace from azureml.pipeline.core import Pipeline from azureml.pipeline.steps import DatabricksStep ws = Workspace.from_config() databricks_step = DatabricksStep( name="run-notebook", notebook_path="/Users/<your-user>/notebook", run_name="DatabricksNotebookRun", cluster_id="cluster-id", databricks_compute="your-databricks-compute" ) pipeline = Pipeline(workspace=ws, steps=[databricks_step]) experiment = Experiment(workspace=ws, name="databricks-integration") pipeline_run = experiment.submit(pipeline)
Using Azure ML and Databricks Jobs API
- Submit a Databricks Job via Azure ML: Use Azure ML to submit a Databricks job using the Jobs API. This involves creating a job in Databricks and triggering it from Azure ML.
from azureml.core import Workspace import requests import json ws = Workspace.from_config() DATABRICKS_URL = 'https://<your-databricks-instance>' DATABRICKS_TOKEN = '<your-databricks-token>' job_payload = { "name": "My Databricks Job", "existing_cluster_id": "cluster-id", "notebook_task": { "notebook_path": "/Users/<your-user>/notebook" } } response = requests.post( f"{DATABRICKS_URL}/api/2.0/jobs/runs/submit", headers={"Authorization": f"Bearer {DATABRICKS_TOKEN}"}, json=job_payload ) run_id = response.json().get("run_id") print(f"Databricks job submitted with run ID: {run_id}")