Optimize performance with caching on Azure Databricks

Azure Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).

Delta cache renamed to disk cache

Disk caching on Azure Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Azure Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol. Behavior remains unchanged with this rename, and is described below.

Automatic and manual caching

The Azure Databricks disk cache differs from Apache Spark caching. Azure Databricks recommends using automatic disk caching for most operations.

When the disk cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE SELECT command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.

The disk cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).

The data stored in the disk cache can be read and operated on faster than the data in the Spark cache. This is because the disk cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.

Unlike the Spark cache, disk caching does not use system memory. Due to the high read speeds of modern SSDs, the disk cache can be fully disk-resident without a negative impact on its performance.

Summary

The following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow:

Feature disk cache Apache Spark cache
Stored as Local files on a worker node. In-memory blocks, but it depends on storage level.
Applied to Any Parquet table stored on ABFS and other file systems. Any DataFrame or RDD.
Triggered Automatically, on the first read (if cache is enabled). Manually, requires code changes.
Evaluated Lazily. Lazily.
Force cache CACHE SELECT command .cache + any action to materialize the cache and .persist.
Availability Can be enabled or disabled with configuration flags, enabled by default on certain node types. Always available.
Evicted Automatically in LRU fashion or on any file change, manually when restarting a cluster. Automatically in LRU fashion, manually with unpersist.

Disk cache consistency

The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.

Selecting instance types to use disk caching

The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.

The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.

Cache a subset of the data

To explicitly select a subset of data to be cached, use the following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

You don’t need to use this command for the disk cache to work correctly (the data will be cached automatically when first accessed). But it can be helpful when you require consistent query performance.

For examples and more details, see

Configure the disk cache

Azure Databricks recommends that you choose cache-accelerated worker instance types for your clusters. Such instances are automatically configured optimally for the disk cache.

Note

When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed.

Configure disk usage

To configure how the disk cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:

  • spark.databricks.io.cache.maxDiskUsage: disk space per node reserved for cached data in bytes
  • spark.databricks.io.cache.maxMetaDataCache: disk space per node reserved for cached metadata in bytes
  • spark.databricks.io.cache.compression.enabled: should the cached data be stored in compressed format

Example configuration:

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

Enable or disable the disk cache

To enable and disable the disk cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.