Implement data processing and analysis workflows with Jobs

Article
03/25/2024

You can use an Azure Databricks job to orchestrate your data processing, machine learning, or data analytics pipelines on the Databricks platform. Azure Databricks Jobs support a number of workload types, including notebooks, scripts, Delta Live Tables pipelines, Databricks SQL queries, and dbt projects. The following articles guide you in using the features and options of Azure Databricks Jobs to implement your data pipelines.

Transform, analyze, and visualize your data with an Azure Databricks job

You can use a job to create a data pipeline that ingests, transforms, analyzes, and visualizes data. The example in Use Databricks SQL in an Azure Databricks job builds a pipeline that:

Uses a Python script to fetch data using a REST API.
Uses Delta Live Tables to ingest and transform the fetched data and save the transformed data to Delta Lake.
Uses the Jobs integration with Databricks SQL to analyze the transformed data and create graphs to visualize the results.

Use dbt transformations in a job

Use the dbt task type if you are doing data transformation with a dbt core project and want to integrate that project into an Azure Databricks job, or you want to create new dbt transformations and run those transformations in a job. See Use dbt transformations in an Azure Databricks job.

Use a Python package in a job

Python wheel files are a standard way to package and distribute the files required to run a Python application. You can easily create a job that uses Python code packaged as a Python wheel file with the Python wheel task type. See Use a Python wheel file in an Azure Databricks job.

Use code packaged in a JAR

Libraries and applications implemented in a JVM language such as Java and Scala are commonly packaged in a Java archive (JAR) file. Azure Databricks Jobs supports code packaged in a JAR with the JAR task type. See Use a JAR in an Azure Databricks job.

Use notebooks or Python code maintained in a central repository

A common way to manage version control and collaboration for production artifacts is to use a central repository such as GitHub. Azure Databricks Jobs supports creating and running jobs using notebooks or Python code imported from a repository, including GitHub or Databricks Git folders. See Use version-controlled source code in an Azure Databricks job.

Orchestrate your jobs with Apache Airflow

Databricks recommends using Azure Databricks Jobs to orchestrate your workflows. However, Apache Airflow is commonly used as a workflow orchestration system and provides native support for Azure Databricks Jobs. While Azure Databricks Jobs provides a visual UI to create your workflows, Airflow uses Python files to define and deploy your data pipelines. For an example of creating and running a job with Airflow, see Orchestrate Azure Databricks jobs with Apache Airflow.

Run a job using a service principal

You can run your jobs as a service account by using a Microsoft Entra ID (formerly Azure Active Directory) application and service principal. Running a job as a service account instead of an individual user allows you to control access to the job, ensure the job has necessary permissions, and prevent issues if a job owner is removed from a workspace. For a tutorial on creating and using a service principal to run an Azure Databricks job, see Run a job with a Microsoft Entra ID service principal.