Optimization recommendations on Azure Databricks
Azure Databricks provides many optimizations supporting a variety of workloads on the lakehouse, ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of these optimizations take place automatically. You get their benefits simply by using Azure Databricks. Additionally, most Databricks Runtime features require Delta Lake, the default storage layer used to create tables in Azure Databricks.
Azure Databricks configures default values that optimize most workloads. But, in some cases, changing configuration settings improves performance.
Databricks Runtime performance enhancements
Use the latest Databricks Runtime to leverage the newest performance enhancements. All behaviors documented here are enabled by default in Databricks Runtime 10.4 LTS and above.
- Disk caching accelerates repeated reads against Parquet data files by loading data to disk volumes attached to compute clusters.
- Dynamic file pruning improves query performance by skipping directories that do not contain data files that match query predicates.
- Low shuffle merge reduces the number of data files rewritten by
MERGEoperations and reduces the need to recaculate
- Apache Spark 3.0 introduced adaptive query execution, which provides enhanced performance for many operations.
Databricks recommendations for enhanced performance
- You can clone tables on Azure Databricks to make deep or shallow copies of source datasets.
- The cost-based optimizer accelerates query performance by leveraging table statistics.
- You can use Spark SQL to interact with semi-structured JSON data without parsing strings.
- Higher order functions provide built-in, optimized performance for many operations that do not have common Spark operators. Higher order functions provide a performance benefit over user defined functions.
- Azure Databricks provides a number of built-in operators and special syntax for working with complex data types, including arrays, structs, and JSON strings.
- You can manually tune settings for joins that include ranges or contain data with substanial skew.
- Azure Databricks provides a write serializable isolation guarantee by default; changing the isolation level to serializable can reduce throughput for concurrent operations, but might be necessary when read serializability is required.
- You can use bloom filter indexes to reduce the likelihood of scanning data files that don’t contain records matching a given condition.