Dynamic file pruning
Dynamic file pruning, can significantly improve the performance of many queries on Delta Lake tables. Dynamic file pruning triggers for queries that contain filter statements or WHERE
clauses. You must use Photon-enabled compute to use dynamic file pruning in MERGE
, UPDATE
, and DELETE
statements. Only SELECT
statements leverage dynamic file pruning when Photon is not used.
Dynamic file pruning is especially efficient for non-partitioned tables, or for joins on non-partitioned columns. The performance impact of dynamic file pruning is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit.
For background and use cases for dynamic file pruning, see Faster SQL queries on Delta Lake with dynamic file pruning.
Configuration
Dynamic file pruning is controlled by the following Apache Spark configuration options:
spark.databricks.optimizer.dynamicFilePruning
(default istrue
): The main flag that directs the optimizer to push down filters. When set tofalse
, dynamic file pruning will not be in effect.spark.databricks.optimizer.deltaTableSizeThreshold
(default is10,000,000,000 bytes (10 GB)
): Represents the minimum size (in bytes) of the Delta table on the probe side of the join required to trigger dynamic file pruning. If the probe side is not very large, it is probably not worthwhile to push down the filters and we can just simply scan the whole table. You can find the size of a Delta table by running theDESCRIBE DETAIL table_name
command and then looking at thesizeInBytes
column.spark.databricks.optimizer.deltaTableFilesThreshold
(default is10
): Represents the number of files of the Delta table on the probe side of the join required to trigger dynamic file pruning. When the probe side table contains fewer files than the threshold value, dynamic file pruning is not triggered. If a table has only a few files, it is probably not worthwhile to enable dynamic file pruning. You can find the size of a Delta table by running theDESCRIBE DETAIL table_name
command and then looking at thenumFiles
column.