Create job setup and configuration

11 minutes

When you need to automate data processing or orchestrate multiple operations in Azure Databricks, you create a Lakeflow Job. A job coordinates tasks, manages their execution order, and allocates compute resources to run your workloads reliably.

In this unit, you learn how to create and configure a Lakeflow Job, including setting up tasks, selecting compute resources, and organizing task dependencies.

Understand job structure

A Lakeflow Job consists of one or more tasks organized as a Directed Acyclic Graph (DAG). The DAG defines execution order and dependencies between tasks, allowing you to build workflows that range from a single notebook execution to complex multi-step data pipelines.

Every job requires at minimum:

A task containing the logic to run
A compute resource to execute the task
A unique name to identify the job

Tasks within a job can execute notebooks, Python scripts, SQL queries, or Lakeflow Spark Declarative Pipelines. You configure each task type differently, but all tasks follow the same pattern: define what to run, specify where to run it, and configure any required parameters.

Create a job and add tasks

To create a new job in Azure Databricks:

In your workspace sidebar, select Jobs & Pipelines.
Select Create, then Job.
Enter a descriptive name for your job.
Configure your first task by specifying the Task name and selecting the Type (such as Notebook, Python script, or SQL).

The task type determines which configuration options appear. For a notebook task, you specify the notebook path and any parameters. For a SQL task, you select a query and SQL warehouse. The following table summarizes common task types and their configuration requirements:

Task type	Key configuration	Compute options
Notebook	Notebook path, parameters	Serverless, classic jobs, all-purpose
Python script	Script path, CLI arguments	Serverless, classic jobs, all-purpose
Python wheel	Wheel package path, entry point	Serverless, classic jobs, all-purpose
SQL	Query or file, SQL warehouse	Serverless SQL warehouse, pro SQL warehouse
Pipeline	Existing pipeline selection	Serverless or classic pipeline compute
dbt / dbt platform	dbt project, profiles	Serverless or classic jobs compute
JAR	Main class, JAR path	Classic jobs compute
Spark Submit	Spark parameters	Classic jobs compute
Run Job	Existing job selection	Determined by the referenced job

Additional task types support control flow within a job. If/else tasks evaluate a condition and route execution to different downstream paths. For each tasks apply the same logic across every item in an input array. These patterns are covered in the Design and implement data pipelines module.

After configuring a task, select Create task to add it to your job.

Configure task sources

Tasks that run code (notebooks, Python scripts, SQL files) need a source location. You have three options for specifying where your code lives:

Workspace stores code directly in your Azure Databricks workspace. Use the file browser to navigate to your notebook or script, then confirm your selection. This option works well for development and simple workflows.

Git provider connects to a remote repository. You specify the repository URL, branch or tag, and the relative path to your file. All tasks in a job share the same Git reference, ensuring consistent code versions across the workflow. When you use Git, Azure Databricks captures a snapshot of the code at run time, so your job executes against a specific commit.

DBFS/ADLS (for Python scripts) allows you to reference files stored in volumes or cloud storage. Provide the full URI, such as abfss://container@storage.dfs.core.windows.net/path/script.py.

Configure compute resources

Each task needs compute resources to execute. Azure Databricks offers several compute options optimized for different workloads.

Serverless compute is the default for supported task types. Azure Databricks manages the infrastructure, so you don't configure cluster settings. Serverless compute reduces operational overhead and scales automatically.

Classic jobs compute gives you control over cluster configuration. You specify the Spark version, instance types, and autoscaling policies. Use classic compute when you need specific configurations or libraries not supported by serverless.

SQL warehouses run SQL tasks. Select an existing serverless or pro SQL warehouse from your workspace.

When multiple tasks share the same compute resource, the cluster remains active until all tasks complete. Sharing compute reduces startup time between tasks but incurs cost during idle periods. You can balance this by grouping related tasks on shared compute while isolating resource-intensive operations.

To view and modify compute configuration:

Open your job and select the Job details panel.
Under Compute, review the resources assigned to each task.
Use Configure to modify classic jobs compute or Swap to change compute for all tasks using a resource.

Set up task dependencies

Jobs with multiple tasks use dependencies to control execution order. Dependencies create the DAG structure that determines which tasks run in sequence and which can run in parallel.

To add a dependency:

Select a task in the task graph.
In the Depends on field, select the upstream tasks that must complete first.
Choose a Run if condition to specify when the downstream task should execute.

The available run-if conditions let you handle various scenarios:

Condition	When the task runs
All succeeded	All upstream tasks completed successfully
At least one succeeded	Any upstream task succeeded
None failed	No upstream tasks failed (some may be skipped)
All done	All upstream tasks finished, regardless of outcome
At least one failed	At least one upstream task failed
All failed	All upstream tasks failed

Use All done for cleanup tasks that should run regardless of earlier results. Use At least one failed to trigger error-handling logic when problems occur.

Add job parameters

Parameters make your jobs reusable by letting you pass different values to each run. You define parameters at the job level, and they're automatically available to all tasks that accept key-value inputs.

To add job parameters:

In the Job details panel, locate the Parameters section.
Select Add and enter a key-value pair.

Tasks access parameters differently based on their type. In notebooks, use dbutils.widgets.get("parameter_name") to retrieve parameter values. Python scripts receive parameters as command-line arguments.

You can also reference dynamic values in parameters. For example, {{job.trigger.time.iso_date}} inserts the trigger date, useful for processing data based on when the job runs.

Organize with tags

Tags help you categorize and filter jobs in your workspace. Add tags as labels or key-value pairs to group related jobs by team, project, or environment.

To add tags:

In the Job details panel, select + Tag.
Enter a key and optionally a value.

Tags also propagate to job clusters, enabling consistent monitoring and cost attribution across your organization.

Configure job access permissions

Job permissions control who can view, run, and manage your Lakeflow Jobs independently from compute permissions. Azure Databricks provides four permission levels:

Permission level	Capabilities
CAN VIEW	View job configuration, task definitions, and run history
CAN RUN	Everything in CAN VIEW plus trigger job runs manually
CAN MANAGE RUN	Everything in CAN RUN plus cancel runs, view output, and restart failed runs
CAN MANAGE	Everything in CAN MANAGE RUN plus edit configuration, modify tasks, change schedules, and set permissions

To configure permissions, navigate to Jobs & Pipelines, select your job, open the Permissions tab, and add users or groups with the appropriate level. Job creators and workspace admins automatically receive CAN MANAGE permissions.

When a job runs, it executes with the job owner's permissions or the configured service principal's permissions—not the triggering user's permissions. For production jobs, grant CAN MANAGE to the pipeline team, CAN RUN to users who need manual execution, and CAN VIEW to stakeholders requiring visibility.

Configure run identity and Unity Catalog access

When your job accesses Unity Catalog objects—such as tables, views, or volumes—the job's run identity must have the required Unity Catalog privileges. This is a critical prerequisite before configuring any job that reads from or writes to Unity Catalog-managed data.

The run identity is the principal whose permissions Unity Catalog evaluates during job execution. By default, jobs run as the job owner (the user who created the job). For production workloads, you can configure a service principal as the run identity to avoid dependency on individual user accounts.

Before creating your job, verify that the run identity has the necessary privileges:

Operation	Required Unity Catalog privilege
Read from a table	`SELECT` on the table
Write to a table	`MODIFY` on the table
Create tables in a schema	`CREATE TABLE` and `USE SCHEMA` on the schema
Access a volume	`READ VOLUME` or `WRITE VOLUME` on the volume

To grant privileges to a service principal or user, use SQL commands like:

GRANT SELECT, MODIFY ON TABLE catalog.schema.table TO `service-principal-id`;
GRANT USE SCHEMA ON SCHEMA catalog.schema TO `service-principal-id`;

If the run identity lacks the required privileges, the job fails at runtime with an authorization error—even if the job configuration itself is valid. Always verify Unity Catalog access before scheduling production jobs.

With your job created, tasks configured, dependencies set, and permissions assigned, you're ready to run your workflow. The next step is understanding how to monitor job execution and handle run outcomes.

Feedback

Was this page helpful?