Optimize performance with caching on Azure Databricks

Άρθρο
05/01/2024

Azure Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).

Note

In SQL warehouses and Databricks Runtime 14.2 and above, the CACHE SELECT command is ignored. An enhanced disk caching algorithm is used instead.

Delta cache renamed to disk cache

Disk caching on Azure Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Azure Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol.

Disk cache vs. Spark cache

The Azure Databricks disk cache differs from Apache Spark caching. Azure Databricks recommends using automatic disk caching.

The following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow:

Feature	disk cache	Apache Spark cache
Stored as	Local files on a worker node.	In-memory blocks, but it depends on storage level.
Applied to	Any Parquet table stored on ABFS and other file systems.	Any DataFrame or RDD.
Triggered	Automatically, on the first read (if cache is enabled).	Manually, requires code changes.
Evaluated	Lazily.	Lazily.
Availability	Can be enabled or disabled with configuration flags, enabled by default on certain node types.	Always available.
Evicted	Automatically in LRU fashion or on any file change, manually when restarting a cluster.	Automatically in LRU fashion, manually with `unpersist`.

Disk cache consistency

The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.

Selecting instance types to use disk caching

The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.

The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.

Configure the disk cache

Azure Databricks recommends that you choose cache-accelerated worker instance types for your compute. Such instances are automatically configured optimally for the disk cache.

Note

When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed.

Configure disk usage

To configure how the disk cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:

spark.databricks.io.cache.maxDiskUsage: disk space per node reserved for cached data in bytes
spark.databricks.io.cache.maxMetaDataCache: disk space per node reserved for cached metadata in bytes
spark.databricks.io.cache.compression.enabled: should the cached data be stored in compressed format

Example configuration:

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

Enable or disable the disk cache

To enable and disable the disk cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.

Κοινή χρήση μέσω