Compute configuration recommendations

Άρθρο
09/18/2024

This article includes recommendations and best practices related to compute configuration.

If your workload is supported, Databricks recommends using serverless compute rather than configuring your own compute resource. Serverless compute is the simplest and most reliable compute option. It requires no configuration, is always available, and scales according to your workload. Serverless compute is available for notebooks, jobs, and Delta Live Tables. See Connect to serverless compute.

Additionally, data analysts can use serverless SQL warehouses to query and explore data on Databricks. See What are Serverless SQL warehouses?.

Use compute policies

If you are creating new compute from scratch, Databricks recommends using compute policies. Compute policies let you create preconfigured compute resources designed for specific purposes, such as personal compute, shared compute, power users, and jobs. Policies limit the decisions you need to make when configuring compute settings.

If you don’t have access to policies, contact your workspace admin. See Default policies and policy families.

Compute sizing considerations

Note

The following recommendations assume that you have unrestricted cluster creation. Workspace admins should only grant this privilege to advanced users.

People often think of compute size in terms of the number of workers, but there are other important factors to consider:

Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute.
Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.

Additional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider:

How much data will your workload consume?
What’s the computational complexity of your workload?
Where are you reading data from?
How is the data partitioned in external storage?
How much parallelism do you need?

Answering these questions will help you determine optimal compute configurations based on workloads.

There’s a balancing act between the number of workers and the size of worker instance types. Configuring compute with two workers, each with 16 cores and 128 GB of RAM, has the same compute and memory as configuring compute with 8 workers, each with 4 cores and 32 GB of RAM.

Compute configuration examples

The following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.

Note

All of the examples in this section (besides machine learning training) could benefit from using serverless compute rather than spinning up a new compute resource. If your workload isn’t supported on serverless, use the recommendations below to help configure your compute resource.

Data analysis

Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. A compute resource with a smaller number of larger nodes can reduce the network and disk I/O needed to perform these shuffles.

A single-node compute with a large VM type is likely the best choice, particularly for a single analyst.

Analytical workloads will likely require reading the same data repeatedly, so recommended node types are storage optimized with disk cache enabled or instances with local storage.

Additional features recommended for analytical workloads include:

Enable auto termination to ensure compute is terminated after a period of inactivity.
Consider enabling autoscaling based on the analyst’s typical workload.

Basic batch ETL

Simple batch ETL jobs that don’t require wide transformations, such as joins or aggregations, typically benefit from Photon. So, select a general-purpose instance that supports Photon.

Instances with lower requirements for memory and storage might result in cost savings over other worker types.

Complex batch ETL

For a complex ETL job, such as one that requires unions and joins across multiple tables, Databricks recommends using fewer workers to reduce the amount of data shuffled. To compensate for having fewer workers, increase the size of your instances.

Complex transformations can be compute-intensive. If you observe significant spill to disk or OOM errors, increase the amount of memory available on your instances.

Optionally, use pools to decrease compute launch times and reduce total runtime when running job pipelines.

Training machine learning models

To train machine learning models, Databricks recommends creating a compute resource using the Personal compute policy.

You should use a single node compute with a large node type for initial experimentation with training machine learning models. Having fewer nodes reduces the impact of shuffles.

Adding more workers can help with stability, but you should avoid adding too many workers because of the overhead of shuffling data.

Recommended worker types are storage optimized with disk caching enabled, or an instance with local storage to account for repeated reads of the same data and to enable caching of training data.

Additional features recommended for machine learning workloads include:

Enable auto termination to ensure compute is terminated after a period of inactivity.
Use pools, which will allow restricting compute to pre-approved instance type.
Ensure consistent compute configurations using policies.

Κοινή χρήση μέσω