Selectively overwrite data with Delta Lake

Azure Databricks leverages Delta Lake functionality to support two distinct options for selective overwrites:

The replaceWhere option atomically replaces all records that match a given predicate.
You can replace directories of data based on how tables are partitioned using dynamic partition overwrites.

For most operations, Databricks recommends using replaceWhere to specify which data to overwrite.

Important

If data has been accidentally overwritten, you can use restore to undo the change.

Arbitrary selective overwrite with `replaceWhere`

You can selectively overwrite only the data that matches an arbitrary expression.

Note

SQL requires Databricks Runtime 12.2 LTS or above.

The following command atomically replaces events in January in the target table, which is partitioned by start_date, with the data in replace_data:

Python

(replace_data.write
  .mode("overwrite")
  .option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
  .table("events")
)

Scala

replace_data.write
  .mode("overwrite")
  .option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
  .table("events")

SQL

INSERT INTO TABLE events REPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31' SELECT * FROM replace_data

This sample code writes out the data in replace_data, validates that all rows match the predicate, and performs an atomic replacement using overwrite semantics. If any values in the operation fall outside the constraint, this operation fails with an error by default.

You can change this behavior to overwrite values within the predicate range and insert records that fall outside the specified range. To do so, disable the constraint check by setting spark.databricks.delta.replaceWhere.constraintCheck.enabled to false using one of the following settings:

Python

spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", False)

Scala

spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", false)

SQL

SET spark.databricks.delta.replaceWhere.constraintCheck.enabled=false

Legacy behavior

Legacy default behavior had replaceWhere overwrite data matching a predicate over partition columns only. With this legacy model, the following command would atomically replace the month January in the target table, which is partitioned by date, with the data in df:

Python

(df.write
  .mode("overwrite")
  .option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
  .table("people10m")
)

Scala

df.write
  .mode("overwrite")
  .option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
  .table("people10m")

If you want to fall back to the old behavior, you can disable the spark.databricks.delta.replaceWhere.dataColumns.enabled flag:

Python

spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", False)

Scala

spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)

SQL

SET spark.databricks.delta.replaceWhere.dataColumns.enabled=false

Dynamic partition overwrites

Dynamic partition overwrite updates only the partitions for which the write commits new data. It overwrites all existing data in those partitions and leaves others unchanged.

Azure Databricks supports two approaches:

REPLACE USING (recommended) - Works across all compute types, including Databricks SQL warehouses, serverless compute, and classic compute. Doesn't require setting a Spark session configuration.
partitionOverwriteMode (legacy) - Requires classic compute and setting a Spark session configuration. Not supported on Databricks SQL or serverless compute.

The sections below demonstrate how to use each approach.

Dynamic partition overwrites with `REPLACE USING`

Databricks Runtime 16.3 and above supports dynamic partition overwrites for partitioned tables using REPLACE USING. This method lets you selectively overwrite data across all compute types, without needing to set a Spark session configuration. REPLACE USING enables compute-independent, atomic overwrite behavior that works on Databricks SQL warehouses, serverless compute, and classic compute.

REPLACE USING only overwrites the partitions targeted by the incoming data. All other partitions remain unchanged.

The following example demonstrates how to use dynamic partition overwrite with REPLACE USING. Currently, you can only use SQL, not Python or Scala. For details, see INSERT in the SQL language reference.

INSERT INTO TABLE events
  REPLACE USING (event_id, start_date)
  SELECT * FROM source_data

Keep the following constraints and behaviors in mind for dynamic partition overwrites:

You must specify the full set of the table's partition columns in the USING clause.
Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.

If you need more customizable matching logic than what REPLACE USING supports, such as treating NULL values as equal, use the complementary REPLACE ON instead. See INSERT for details.

Dynamic partition overwrites with `partitionOverwriteMode` (legacy)

Important

This feature is in Public Preview.

Databricks Runtime 11.3 LTS and above supports dynamic partition overwrites for partitioned tables using overwrite mode: either INSERT OVERWRITE in SQL, or a DataFrame write with df.write.mode("overwrite"). This type of overwrite is only available for classic compute, not Databricks SQL warehouses or serverless compute.

Configure dynamic partition overwrite mode by setting the Spark session configuration spark.sql.sources.partitionOverwriteMode to dynamic. Alternatively, you can set the DataFrameWriter option partitionOverwriteMode to dynamic. If present, the query-specific option overrides the mode defined in the session configuration. The default for spark.sql.sources.partitionOverwriteMode is static.

The following example demonstrates using partitionOverwriteMode:

SQL

SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;

Python

(df.write
  .mode("overwrite")
  .option("partitionOverwriteMode", "dynamic")
  .saveAsTable("default.people10m")
)

Scala

df.write
  .mode("overwrite")
  .option("partitionOverwriteMode", "dynamic")
  .saveAsTable("default.people10m")

Keep the following constraints and behaviors in mind for partitionOverwriteMode:

You can't set overwriteSchema to true.
You can't specify both partitionOverwriteMode and replaceWhere in the same DataFrameWriter operation.
If you specify a replaceWhere condition using a DataFrameWriter option, Delta Lake applies that condition to control which data is overwritten. This option takes precedence over the partitionOverwriteMode session-level configuration.
Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.

Ανάδραση

Ήταν χρήσιμη αυτή η σελίδα;

Last updated on 2025-09-03

Κοινοποίηση μέσω

Selectively overwrite data with Delta Lake

Arbitrary selective overwrite with replaceWhere

Python

Scala

SQL

Python

Scala

SQL

Legacy behavior

Python

Scala

Python

Scala

SQL

Dynamic partition overwrites

Dynamic partition overwrites with REPLACE USING

Dynamic partition overwrites with partitionOverwriteMode (legacy)

SQL

Python

Scala

Ανάδραση

Πρόσθετοι πόροι

Arbitrary selective overwrite with `replaceWhere`

Dynamic partition overwrites with `REPLACE USING`

Dynamic partition overwrites with `partitionOverwriteMode` (legacy)