Schedule and orchestrate workflows

Artikkeli
11/05/2024

Databricks Workflows has tools that allow you to schedule and orchestrate data processing tasks on Azure Databricks. You use Databricks Workflows to configure Databricks Jobs.

This article introduces concepts and choices related to managing production workloads using Databricks Jobs.

What are Databricks Jobs?

A job is the primary unit for scheduling and orchestrating production workloads on Azure Databricks. Jobs consist of one or more tasks. Together, tasks and jobs allow you to configure and deploy the following:

Custom logic, including Spark, SQL, OSS Python, ML, and arbitrary code.
Compute resources with custom environments and libraries.
Schedules and triggers for running workloads.
Conditional logic for control flow between tasks.

Jobs provide a procedural approach to defining relationships between tasks. Delta Live Tables pipelines provide a declarative approach to defining relationships between datasets and transformations. You can include Delta Live Tables pipelines as a task in a job. See Delta Live Tables pipeline task for jobs.

Jobs can vary in complexity from a single task running a Databricks notebook to thousands of tasks running with conditional logic and dependencies.

How can I configure and run Jobs?

You can create and run a job using the Jobs UI, the Databricks CLI, or by invoking the Jobs API. Using the UI or API, you can repair and re-run a failed or canceled job. You can monitor job run results using the UI, CLI, API, and notifications (for example, email, webhook destination, or Slack notifications).

If you prefer an infrastructure-as-code (IaC) approach to configuring and orchestrating your Jobs, use Databricks Asset Bundles (DABs). Bundles can contain YAML definitions of jobs and tasks, are managed using the Databricks CLI, and can be shared and run in different target workspaces (such as development, staging, and production). To learn about using DABs to configure and orchestrate your jobs, see Databricks Asset Bundles.

To learn about using the Databricks CLI, see What is the Databricks CLI?. To learn about using the Jobs API, see the Jobs API.

What is the minimum configuration needed for a job?

All jobs on Azure Databricks require the following:

Source code (such as a Databricks notebook) that contains logic to be run.
A compute resource to run the logic. The compute resource can be serverless compute, classic jobs compute, or all-purpose compute. See Configure compute for jobs.
A specified schedule for when the job should be run. Optionally, you can omit setting a schedule and trigger the job manually.
A unique name.

Note

If you develop your code in Databricks notebooks, you can use the Schedule button to configure that notebook as a job. See Create and manage scheduled notebook jobs.

What is a task?

A task represents a unit of logic to be run as a step in a job. Tasks can range in complexity and can include the following:

A notebook
A JAR
SQL queries
A DLT pipeline
Another job
Control flow tasks

You can control the execution order of tasks by specifying dependencies between them. You can configure tasks to run in sequence or parallel.

Jobs interact with state information and metadata of tasks, but task scope is isolated. You can use task values to share context between scheduled tasks. See Use task values to pass information between tasks.

What control flow options are available for jobs?

When configuring jobs and tasks in jobs, you can customize settings that control how the entire job and individual tasks run. These options are:

Triggers
Retries
Run if conditional tasks
If/else conditional tasks
For each tasks
Duration thresholds
Concurrency settings

Trigger types

You must specify a trigger type when you configure a job. You can choose from the following trigger types:

You can also choose to trigger your job manually, but this is mainly reserved for specific use cases such as:

You use an external orchestration tool to trigger jobs using REST API calls.
You have a job that runs rarely and requires manual intervention for validation or resolving data quality issues.
You are running a workload that only needs to be run once or a few times, such as a migration.

See Trigger types for Databricks Jobs.

Retries

Retries specify how many times a particular task should be re-run if the task fails with an error message. Errors are often transient and resolved through restart. Some features on Azure Databricks, such as schema evolution with Structured Streaming, assume that you run jobs with retries to reset the environment and allow a workflow to proceed.

If you specify retries for a task, the task restarts up to the specified number of times if it encounters an error. Not all job configurations support task retries. See Set a retry policy.

When running in continuous trigger mode, Databricks automatically retries with exponential backoff. See How are failures handled for continuous jobs?.

Run if conditional tasks

You can use the Run if task type to specify conditionals for later tasks based on the outcome of other tasks. You add tasks to your job and specify upstream-dependent tasks. Based on the status of those tasks, you can configure one or more downstream tasks to run. Jobs support the following dependencies:

All succeeded
At least one succeeded
None failed
All done
At least one failed
All failed

See Configure task dependencies

If/else conditional tasks

You can use the If/else task type to specify conditionals based on some value. See Add branching logic to a job with the If/else task.

Jobs support taskValues that you define in your logic and allow you to return the results of some computation or state from a task to the jobs environment. You can define If/else conditions against taskValues, job parameters, or dynamic values.

Azure Databricks supports the following operands for conditionals:

==
!=
>
>=
<
<=

For each tasks

Use the For each task to run another task in a loop, passing a different set of parameters to each iteration of the task.

Adding the For each task to a job requires defining two tasks: The For each task and a nested task. The nested task is the task to run for each iteration of the For each task and is one of the standard Databricks Jobs task types. Multiple methods are supported for passing parameters to the nested task.

See Run a parameterized Azure Databricks job task in a loop.

Duration threshold

You can specify a duration threshold to send a warning or stop a task or job if a specified duration is exceeded. Examples of when you might want to configure this setting include the following:

You have tasks prone to getting stuck in a hung state.
You must warn an engineer if an SLA for a workflow is exceeded.
To avoid unexpected costs, you want to fail a job configured with a large cluster.

See Configure an expected completion time or a timeout for a job and Configure an expected completion time or a timeout for a task.

Concurrency

Most jobs are configured with the default concurrency of 1 concurrent job. This means that if a previous job run has not completed by the time a new job should be triggered, the next job run is skipped.

Some use cases exist for increased concurrency, but most workloads do not require altering this setting.

For more information about configuring concurrency, see Databricks Jobs queueing and concurrency settings.

How can I monitor jobs?

The jobs UI lets you see job runs, including runs in progress. See Monitoring and observability for Databricks Jobs.

You can receive notifications when a job or task starts, completes, or fails. You can send notifications to one or more email addresses or system destinations. See Add email and system notifications for job events.

System tables include a lakeflow schema where you can view records related to job activity in your account. See Jobs system table reference.

You can also join the jobs system tables with billing tables to monitor the cost of jobs across your account. See Monitor job costs with system tables.

Limitations

The following limitations exist:

A workspace is limited to 2000 concurrent task runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.
A workspace can contain up to 12000 saved jobs.
A job can contain up to 100 tasks.

Can I manage workflows programmatically?

Databricks has tools and APIs that allow you to schedule and orchestrate your workflows programmatically, including the following:

For more information about developer tools, see Developer tools.

Workflow orchestration with Apache AirFlow

You can use Apache Airflow to manage and schedule your data workflows. With Airflow, you define your workflow in a Python file, and Airflow manages scheduling and running the workflow. See Orchestrate Azure Databricks jobs with Apache Airflow.

Workflow orchestration with Azure Data Factory

Azure Data Factory (ADF) is a cloud data integration service that lets you compose data storage, movement, and processing services into automated data pipelines. You can use ADF to orchestrate an Azure Databricks job as part of an ADF pipeline.

ADF also has built-in support to run Databricks notebooks, Python scripts, or code packaged in JARs in an ADF pipeline.

To learn how to run a Databricks notebook in an ADF pipeline, see Run a Databricks notebook with the Databricks notebook activity in Azure Data Factory, followed by Transform data by running a Databricks notebook.

To learn how to run a Python script in an ADF pipeline, see Transform data by running a Python activity in Azure Databricks.

To learn how to run code packaged in a JAR in an ADF pipeline, see Transform data by running a JAR activity in Azure Databricks.

Jaa