Share via


Deletion vectors in Databricks

Deletion vectors are a storage optimization feature that accelerates modifications to tables. By default, deleting a single row requires rewriting the entire Parquet file containing that record. Deletion vectors avoid this overhead. When deletion vectors are enabled, DELETE, UPDATE, and MERGE operations mark rows as modified without rewriting the Parquet file. Reads then resolve the current table state by applying the modifications recorded in deletion vectors.

Databricks recommends using Databricks Runtime 14.3 LTS and above to write tables with deletion vectors to use all optimizations. To read tables with deletion vectors enabled, use Databricks Runtime 12.2 LTS and above.

In Databricks Runtime 14.2 and above, tables with deletion vectors support row-level concurrency. See Row-level concurrency.

Note

For predictive I/O updates, Photon uses deletion vectors to accelerate DELETE, MERGE, and UPDATE operations. See Use predictive I/O to accelerate updates.

Enable deletion vectors

In the workspace settings you can enable deletion vectors on new tables when you use a SQL warehouse or Databricks Runtime 14.3 LTS or above. Default settings vary by region, see Auto-enable deletion vectors.

Deletion vectors are not enabled by default for materialized views and streaming tables stored in Hive metastore.

To manually enable or remove support for deletion vectors on any table or view, including streaming tables and materialized views, use the enableDeletionVectors table property. To enable deletion vectors on a table when you create or alter a table:

-- For Delta tables
CREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true);

ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

You can't use an ALTER statement to enable or remove deletion vectors on a materialized view or Streaming table.

For Iceberg tables, use iceberg.enableDeletionVectors instead of delta.enableDeletionVectors.

Warning

When you enable deletion vectors, Databricks upgrades the table protocol. After upgrading, clients without deletion vector support can't read the table. See Delta Lake feature compatibility and protocols.

In Databricks Runtime 14.1 and above, you can drop the deletion vectors table feature to enable compatibility with other clients. See Drop a Delta Lake table feature and downgrade table protocol.

Apply changes to Parquet data files

Deletion vectors indicate changes to rows as soft-deletes that logically modify existing Parquet data files in the table. These changes are applied physically when one of the following events causes the data files to be rewritten:

  • An OPTIMIZE command is run on the table.
  • Auto-compaction triggers a rewrite of a data file with a deletion vector.
  • REORG TABLE ... APPLY (PURGE) is run against the table.

Events related to file compaction don't have strict guarantees for resolving changes recorded in deletion vectors. Some changes recorded in deletion vectors might not be physically applied if target data files are not candidates for file compaction. REORG TABLE ... APPLY (PURGE) rewrites all data files containing records with modifications recorded using deletion vectors. See REORG TABLE.

Physically delete old data

Modified data might still exist in a table's old data files after a purge operation. You might want to physically remove the data, for example, to reduce storage costs with your cloud provider or to comply with GDPR requests.

Run VACUUM to physically delete the old files. The REORG TABLE ... APPLY (PURGE) operation creates a new version of the table when it completes. To fully remove deleted files from previous table versions, you must set the retention threshold for VACUUM to the purge operation's completion timestamp. See Purge metadata-only deletes to force data rewrite.

Improve performance for large tables

To improve performance when you purge soft-deleted data on large tables, set spark.databricks.delta.reorg.purgeMode to rows. For example, set this configuration when you purge data manually with REORG TABLE ... APPLY (PURGE) or when you remove deletion vectors with ALTER TABLE DROP FEATURE deletionVectors.

By default, spark.databricks.delta.reorg.purgeMode is set to all. On large tables, this operation might be slow because purge operations must scan all Parquet file footers to check for both dropped column data and soft-deleted rows.

The rows value limits the operation to handle only files with soft-deleted rows. On large tables, this might improve performance if many files don't contain soft-deleted rows and the table has no dropped columns.

Client compatibility

Azure Databricks uses deletion vectors to power predictive I/O for updates on Photon-enabled compute. See Use predictive I/O to accelerate updates.

Support for using deletion vectors for reads and writes varies by client.

The following table shows required client versions for reading and writing tables with deletion vectors enabled and specifies which write operations use deletion vectors:

Client Write deletion vectors Read deletion vectors
Databricks Runtime with Photon Supports MERGE, UPDATE, and DELETE using Databricks Runtime 12.2 LTS and above. Requires Databricks Runtime 12.2 LTS or above.
Databricks Runtime without Photon Supports DELETE using Databricks Runtime 12.2 LTS and above. Supports UPDATE using Databricks Runtime 14.1 and above. Supports MERGE using Databricks Runtime 14.3 LTS and above. Requires Databricks Runtime 12.2 LTS or above.
OSS Apache Spark with OSS Delta Lake Supports DELETE using OSS Delta 2.4.0 and above. Supports UPDATE using OSS Delta 3.0.0 and above. Requires OSS Delta 2.3.0 or above.
Delta Sharing recipients Writes are not supported on Delta Sharing tables. Databricks: Requires Databricks Runtime 14.1 or above. Open source Apache Spark: Requires delta-sharing-spark 3.1 or above.

For support with other clients, see the OSS Delta Lake integrations documentation.

Limitations

  • UniForm Iceberg v2 doesn't support deletion vectors. Apache Iceberg v3 supports deletion vectors on tables with UniForm enabled. See Use Apache Iceberg v3 features.
  • You cannot use a GENERATE statement to generate a manifest file for a table that has files using deletion vectors. To generate a manifest, first run a REORG TABLE … APPLY (PURGE) statement and then run the GENERATE statement. You must ensure that no concurrent write operations are running when you submit the REORG statement.
  • You cannot incrementally generate manifest files for a table with deletion vectors enabled (for example, by setting the table property delta.compatibility.symlinkFormatManifest.enabled=true).
  • If you enable deletion vectors on a materialized view or Streaming table and subsequently remove deletion vectors, deletion vectors don't apply to future writes to the view or table, but existing deletion vectors remain.
  • You cannot downgrade the table protocol after enabling deletion vectors on a materialized view or Streaming table. After enabling, the table feature for deletion vectors cannot be removed, even if you subsequently disable deletion vectors on the view or table.