What are deletion vectors?
Deletion vectors are a storage optimization feature that can be enabled on Delta Lake tables. By default, when a single row in a data file is deleted, the entire Parquet file containing the record must be rewritten. With deletion vectors enabled for the table,
UPDATE operations use deletion vectors to mark existing rows as removed or changed without rewriting the Parquet file. Subsequent reads on the table resolve current table state by applying the deletions noted by deletion vectors to the most recent table version.
Databricks recommends using Databricks Runtime 14.1 and above to write tables with deletion vectors to leverage all optimizations. You can read tables with deletion vectors enabled in Databricks Runtime 12.1 and above.
In Databricks Runtime 14.2 and above, tables with deletion vectors support row-level concurrency. See Write conflicts with row-level concurrency.
Photon leverages deletion vectors for predictive I/O updates, accelerating
UPDATE operations. All clients that support reading deletion vectors can read updates that produced deletion vectors, regardless of whether these updates were produced by predictive I/O. See Use predictive I/O to accelerate updates.
A workspace admin setting controls whether deletion vectors are auto-enabled for new Delta tables. See Auto-enable deletion vectors.
You enable support for deletion vectors on a Delta Lake table by setting a Delta Lake table property. You enable deletion vectors during table creation or alter an existing table, as in the following examples:
CREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true); ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);
When you enable deletion vectors, the table protocol is upgraded. After upgrading, the table will not be readable by Delta Lake clients that do not support deletion vectors. See How does Azure Databricks manage Delta Lake feature compatibility?.
In Databricks Runtime 14.1 and above, you can drop the deletion vectors table feature to enable compatibility with other Delta clients. See Drop Delta table features.
Apply changes to Parquet data files
Deletion vectors indicate changes to rows as soft-deletes that logically modify existing Parquet data files in the Delta Lake table. These changes are applied physically when data files are rewritten, as triggered by one of the following events:
OPTIMIZEcommand is run on the table.
- Auto-compaction triggers a rewrite of a data file with a deletion vector.
REORG TABLE ... APPLY (PURGE)is run against the table.
Events related to file compaction do not have strict guarantees for resolving changes recorded in deletion vectors, and some changes recorded in deletion vectors might not be applied if target data files would not otherwise be candidates for file compaction.
REORG TABLE ... APPLY (PURGE) rewrites all data files containing records with modifications recorded using deletion vectors. See REORG TABLE.
Modified data might still exist in the old files. You can run VACUUM to physically delete the old files.
REORG TABLE ... APPLY (PURGE) creates a new version of the table at the time it completes, which is the timestamp you must consider for the retention threshold for your
VACUUM operation to fully remove deleted files. See Remove unused data files with vacuum.
Azure Databricks leverages deletion vectors to power predictive I/O for updates on Photon-enabled compute. See Use predictive I/O to accelerate updates.
Support for leveraging deletion vectors for reads and writes varies by client.
The following table denotes required client versions for reading and writing Delta tables with deletion vectors enabled and specifies which write operations leverage deletion vectors:
|Client||Write deletion vectors||Read deletion vectors|
|Databricks Runtime with Photon||Supports
||Requires Databricks Runtime 12.1 or above.|
|Databricks Runtime without Photon||Supports
||Requires Databricks Runtime 12.1 or above.|
|OSS Apache Spark with OSS Delta Lake||Supports
||Requires OSS Delta 2.3.0 or above.|
|Delta Sharing recipients||Delta Sharing recipients cannot write to shared data.||Supported on Azure Databricks using Databricks Runtime 14.1 and above. Not supported in other Delta Sharing clients.|
For support in other Delta clients, see the OSS Delta Lake integrations documentation.
Do not enable deletion vectors for streaming tables when using either Databricks SQL or Delta Live Tables.
You can enable deletion vectors for Materialized views. To disable deletion vectors for a Materialized view, you must drop the Materialized view and recreate it.
In Databricks Runtime 12.1 and greater, the following limitations exist:
- Delta Sharing is not supported on tables with deletion vectors enabled.
- You cannot generate a manifest file for a table with deletion vectors present. Run
REORG TABLE ... APPLY (PURGE)and ensure no concurrent write operations are running in order to generate a manifest.
- You cannot incrementally generate manifest files for a table with deletion vectors enabled.