Configure compute for jobs

Artikkeli
11/04/2024

This article contains recommendations and resources for configuring compute for Databricks Jobs.

Important

Limitations for serverless compute for jobs include the following:

No support for Continuous scheduling.
No support for default or time-based interval triggers in Structured Streaming.

For more limitations, see Serverless compute limitations.

Each job can have one or more tasks. You define compute resources for each task. Multiple tasks defined for the same job can use the same compute resource.

Image showing a job with multiple takes and associated cloud compute resources

What is the recommended compute for each task?

The following table indicates the recommended and supported compute types for each task type.

Note

Serverless compute for jobs has limitations and does not support all workloads. See Serverless compute limitations.

Task	Recommended compute	Supported compute
Notebooks	Serverless jobs	Serverless jobs, classic jobs, classic all-purpose
Python script	Serverless jobs	Serverless jobs, classic jobs, classic all-purpose
Python wheel	Serverless jobs	Serverless jobs, classic jobs, classic all-purpose
SQL	Serverless SQL warehouse	Serverless SQL warehouse, pro SQL warehouse
Delta Live Tables pipeline	Serverless pipeline	Serverless pipeline, classic pipeline
dbt	Serverless SQL warehouse	Serverless SQL warehouse, pro SQL warehouse
dbt CLI commands	Serverless jobs	Serverless jobs, classic jobs, classic all-purpose
JAR	Classic jobs	Classic jobs, classic all-purpose
Spark Submit	Classic jobs	Classic jobs

Pricing for Jobs is tied to the compute used to run tasks. For more details, see Databricks pricing.

How do I configure compute for Jobs?

Classic jobs compute is configured directly from the Databricks Jobs UI, and these configurations are part of the job definition. All other available compute types store their configurations with other workspace assets. The following table has more details:

Compute type	Details
Classic jobs compute	You configure compute for classic jobs using the same UI and settings available for all-purpose compute. See Compute configuration reference.
Serverless compute for jobs	Serverless compute for jobs is the default for all tasks that support it. Databricks manages compute settings for serverless compute. See Run your Azure Databricks job with serverless compute for workflows. nn A workspace admin must enable serverless compute for this option to be visible. See Enable serverless compute.
SQL warehouses	Serverless and pro SQL warehouses are configured by workspace admins or users with unrestricted cluster creation privileges. You configure tasks to run against existing SQL warehouses. See Connect to a SQL warehouse.
Delta Live Tables pipeline compute	You configure compute settings for Delta Live Tables pipelines during pipeline configuration. See Configure compute for a Delta Live Tables pipeline. nn Azure Databricks manages compute resources for serverless Delta Live Tables pipelines. See Configure a serverless Delta Live Tables pipeline.
All-purpose compute	You can optionally configure tasks using classic all-purpose compute. Databricks does not recommend this configuration for production jobs. See Compute configuration reference and Should all-purpose compute ever be used for jobs?.

Configure tasks to use the same jobs compute resources to optimize resource usage with jobs that orchestrate multiple tasks. Sharing compute across tasks can reduce latency associated with start-up times.

You can use a single job compute resource to run all tasks that are part of the job or multiple job resources optimized for specific workloads. Any job compute configured as part of a job is available for all other tasks in the job.

The following table highlights differences between job compute configured for a single task and job compute shared between tasks:

	Single task	Shared across tasks
Start	When the task run begins.	When the first task run configured to use the compute resource begins.
Terminate	After the task runs.	After the final task configured to use the compute resource runs.
Idle compute	Not applicable.	Compute remains on and idle while tasks not using the compute resource run.

A shared job cluster is scoped to a single job run and cannot be used by other jobs or runs of the same job.

Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task settings.

Review, configure, and swap jobs compute

The Compute section in the Job details panel lists all compute configured for tasks in the current job.

Tasks configured to use a compute resource are highlighted in the task graph when you hover over the compute specification.

Use the Swap button to change the compute for all tasks associated with a compute resource.

Classic jobs compute resources have a Configure option. Other compute resources give you options to view and modify compute configuration details.

Recommendations for configuring classic jobs compute

This section focuses on general recommendations about features and configurations that can benefit some workflows. Specific recommendations for configuring the size and types of compute resources vary based on the workload.

Databricks recommends enabling Photon Acceleration, using recent Databricks Runtime versions, and using compute configured for Unity Catalog.

Serverless compute for jobs manages all infrastructure, eliminating the following considerations. See Run your Azure Databricks job with serverless compute for workflows.

Note

Structured Streaming workflows have specific recommendations. See Production considerations for Structured Streaming.

Use shared access mode

Databricks recommends using shared access mode for jobs. See Access modes.

Note

Shared access mode does not support some workloads and features. Databricks recommends single user access mode for these workloads. See Compute access mode limitations for Unity Catalog.

Use cluster policies

Databricks recommends that workspace admins define cluster policies for jobs and enforce these policies for all users who configure jobs.

Cluster policies allow workspace admins to set cost controls and limit users’ configuration options. For details on configuring cluster policies, see Create and manage compute policies.

Azure Databricks provides a default policy configured for jobs. Admins can make this policy available to other workspace users. See Job Compute.

Use autoscaling

Configure autoscaling so that long-running tasks can dynamically add and remove worker nodes during job runs. See Enable autoscaling.

Use a pool to reduce cluster start times

Compute pools allow you to reserve compute resources from your cloud provider. Pools are beneficial to decrease new job cluster start time and ensure compute resource availability. See Pool configuration reference.

Use spot instances

Configure spot instances for workloads that have lax latency requirements to optimize costs. See Spot instances.

Should all-purpose compute ever be used for jobs?

There are numerous reasons that Databricks recommends against using all-purpose compute for jobs, including the following:

Azure Databricks bills for all-purpose compute at a different rate than jobs compute.
Jobs compute terminates automatically after a job run is complete. All-purpose compute supports auto-termination, which is tied to inactivity rather than the end of a job run.
All-purpose compute is often shared across teams of users. Jobs scheduled against all-purpose compute often have increased latency due to competition for compute resources.
Many recommendations for optimizing jobs compute configuration are not appropriate for the type of ad-hoc queries and interactive workloads run on all-purpose compute.

The following are use cases in which you might choose to use all-purpose compute for jobs:

You are iteratively developing or testing new jobs. Start-up times for jobs compute can make iterative development tedious. All-purpose compute allows you to apply changes and run your job quickly.
You have short-lived jobs that must be run frequently or on a specific schedule. There is no start-up time associated with the currently running all-purpose compute. Consider costs associated with idle time if using this pattern.

Serverless compute for jobs is the recommended substitute for most task types you might consider running against all-purpose compute.

Jaa

Configure compute for jobs

What is the recommended compute for each task?

How do I configure compute for Jobs?

Review, configure, and swap jobs compute

Recommendations for configuring classic jobs compute

Use shared access mode

Use cluster policies

Use autoscaling

Use a pool to reduce cluster start times

Use spot instances

Should all-purpose compute ever be used for jobs?

Palaute

Lisäresursseja

Jaa

Configure compute for jobs

What is the recommended compute for each task?

How do I configure compute for Jobs?

Share compute across tasks

Review, configure, and swap jobs compute

Recommendations for configuring classic jobs compute

Use shared access mode

Use cluster policies

Use autoscaling

Use a pool to reduce cluster start times

Use spot instances

Should all-purpose compute ever be used for jobs?

Palaute

Lisäresursseja