Create your first workflow with an Azure Databricks job
Article
This article demonstrates an Azure Databricks job that orchestrates tasks to read and process a sample dataset. In this quickstart, you:
Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.
Save the sample dataset to Unity Catalog.
Create a new notebook and add code to read the dataset from Unity Catalog, filter it by year, and display the results.
Create a new job and configure two tasks using the notebooks.
Run the job and view the results.
Requirements
If your workspace is Unity Catalog-enabled and Serverless Jobs is enabled, by default, the job runs on Serverless compute. You do not need cluster creation permission to run your job with Serverless compute.
You must have a volume in Unity Catalog. This article uses a volume named my-volume in a schema named default within a catalog named main. Also, you must have the following permissions in Unity Catalog:
READ VOLUME and WRITE VOLUME, or ALL PRIVILEGES, for the my-volume volume.
USE SCHEMA or ALL PRIVILEGES for the default schema.
USE CATALOG or ALL PRIVILEGES for the main catalog.
To create a notebook to retrieve the sample dataset and save it to Unity Catalog:
Go to your Azure Databricks landing page and click New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.
To create a notebook to read and present the data for filtering:
Go to your Azure Databricks landing page and click New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.
Copy the following Python code and paste it into the first cell of the notebook.
babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my-volume/babynames.csv")
babynames.createOrReplaceTempView("babynames_table")
years = spark.sql("select distinct(Year) from babynames_table").toPandas()['Year'].tolist()
years.sort()
dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
Create a job
Click Workflows in the sidebar.
Click .
The Tasks tab displays with the create task dialog.
Replace Add a name for your job… with your job name.
In the Task name field, enter a name for the task; for example, retrieve-baby-names.
In the Type drop-down menu, select Notebook.
Use the file browser to find the first notebook you created, click the notebook name, and click Confirm.
Click Create task.
Click below the task you just created to add another task.
In the Task name field, enter a name for the task; for example, filter-baby-names.
In the Type drop-down menu, select Notebook.
Use the file browser to find the second notebook you created, click the notebook name, and click Confirm.
Click Add under Parameters. In the Key field, enter year. In the Value field, enter 2014.
Click Create task.
Run the job
To run the job immediately, click in the upper right corner. You can also run the job by clicking the Runs tab and clicking Run now in the Active Runs table.
View run details
Click the Runs tab and click the link for the run in the Active Runs table or in the Completed Runs (past 60 days) table.
Click either task to see the output and details. For example, click the filter-baby-names task to view the output and run details for the filter task:
Run with different parameters
To re-run the job and filter baby names for a different year:
Click next to Run now and select Run now with different parameters or click Run now with different parameters in the Active Runs table.
Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.