Optimize performance with caching on Azure Databricks
Azure Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).
Delta cache renamed to disk cache
Disk caching on Azure Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Azure Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol. Behavior remains unchanged with this rename, and is described below.
The Azure Databricks disk cache differs from Apache Spark caching. Azure Databricks recommends using automatic disk caching for most operations.
When the disk cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the
CACHE SELECT command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.
The disk cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
The data stored in the disk cache can be read and operated on faster than the data in the Spark cache. This is because the disk cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
Unlike the Spark cache, disk caching does not use system memory. Due to the high read speeds of modern SSDs, the disk cache can be fully disk-resident without a negative impact on its performance.
The following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow:
|Apache Spark cache
|Local files on a worker node.
|In-memory blocks, but it depends on storage level.
|Any Parquet table stored on ABFS and other file systems.
|Any DataFrame or RDD.
|Automatically, on the first read (if cache is enabled).
|Manually, requires code changes.
CACHE SELECT command
.cache + any action to materialize the cache and
|Can be enabled or disabled with configuration flags, enabled by default on certain node types.
|Automatically in LRU fashion or on any file change, manually when restarting a cluster.
|Automatically in LRU fashion, manually with
Disk cache consistency
The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.
The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.
The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.
To explicitly select a subset of data to be cached, use the following syntax:
CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]
You don’t need to use this command for the disk cache to work correctly (the data will be cached automatically when first accessed). But it can be helpful when you require consistent query performance.
For examples and more details, see
- Databricks Runtime 7.x and above: CACHE SELECT
Azure Databricks recommends that you choose cache-accelerated worker instance types for your clusters. Such instances are automatically configured optimally for the disk cache.
When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed.
Configure disk usage
To configure how the disk cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:
spark.databricks.io.cache.maxDiskUsage: disk space per node reserved for cached data in bytes
spark.databricks.io.cache.maxMetaDataCache: disk space per node reserved for cached metadata in bytes
spark.databricks.io.cache.compression.enabled: should the cached data be stored in compressed format
To enable and disable the disk cache, run:
spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.