Optimize performance with caching on Azure Databricks
Azure Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).
Note
In SQL warehouses and Databricks Runtime 14.2 and above, the CACHE SELECT
command is ignored. An enhanced disk caching algorithm is used instead.
Delta cache renamed to disk cache
Disk caching on Azure Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Azure Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol.
Disk cache vs. Spark cache
The Azure Databricks disk cache differs from Apache Spark caching. Azure Databricks recommends using automatic disk caching.
The following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow:
Feature | disk cache | Apache Spark cache |
---|---|---|
Stored as | Local files on a worker node. | In-memory blocks, but it depends on storage level. |
Applied to | Any Parquet table stored on ABFS and other file systems. | Any DataFrame or RDD. |
Triggered | Automatically, on the first read (if cache is enabled). | Manually, requires code changes. |
Evaluated | Lazily. | Lazily. |
Availability | Can be enabled or disabled with configuration flags, enabled by default on certain node types. | Always available. |
Evicted | Automatically in LRU fashion or on any file change, manually when restarting a cluster. | Automatically in LRU fashion, manually with unpersist . |
Disk cache consistency
The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.
Selecting instance types to use disk caching
The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.
The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.
Configure the disk cache
Azure Databricks recommends that you choose cache-accelerated worker instance types for your compute. Such instances are automatically configured optimally for the disk cache.
Note
When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed.
Configure disk usage
To configure how the disk cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:
spark.databricks.io.cache.maxDiskUsage
: disk space per node reserved for cached data in bytesspark.databricks.io.cache.maxMetaDataCache
: disk space per node reserved for cached metadata in bytesspark.databricks.io.cache.compression.enabled
: should the cached data be stored in compressed format
Example configuration:
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Enable or disable the disk cache
To enable and disable the disk cache, run:
spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.