Optimization recommendations on Azure Databricks
Azure Databricks provides many optimizations supporting a variety of workloads on the lakehouse, ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of these optimizations take place automatically. You get their benefits simply by using Azure Databricks. Additionally, most Databricks Runtime features require Delta Lake, the default format used to create tables in Azure Databricks.
Azure Databricks configures default values that optimize most workloads. But, in some cases, changing configuration settings improves performance.
Databricks Runtime performance enhancements
Note
Use the latest Databricks Runtime to leverage the newest performance enhancements. All behaviors documented here are enabled by default in Databricks Runtime 10.4 LTS and above.
- Disk caching accelerates repeated reads against Parquet data files by loading data to disk volumes attached to compute clusters.
- Dynamic file pruning improves query performance by skipping directories that do not contain data files that match query predicates.
- Low shuffle merge reduces the number of data files rewritten by
MERGE
operations and reduces the need to recaculateZORDER
clusters. - Apache Spark 3.0 introduced adaptive query execution, which provides enhanced performance for many operations.
Databricks recommendations for enhanced performance
- You can clone tables on Azure Databricks to make deep or shallow copies of source datasets.
- The cost-based optimizer accelerates query performance by leveraging table statistics.
- You can use Spark SQL to interact with JSON strings without parsing strings.
- Higher order functions provide built-in, optimized performance for many operations that do not have common Spark operators. Higher order functions provide a performance benefit over user defined functions.
- Azure Databricks provides a number of built-in operators and special syntax for working with complex data types, including arrays, structs, and JSON strings.
- You can manually tune settings for range joins. See Range join optimization.
Opt-in behaviors
- Azure Databricks provides a write serializable isolation guarantee by default; changing the isolation level to serializable can reduce throughput for concurrent operations, but might be necessary when read serializability is required.
- You can use bloom filter indexes to reduce the likelihood of scanning data files that don’t contain records matching a given condition.