Bemærk
Adgang til denne side kræver godkendelse. Du kan prøve at logge på eller ændre mapper.
Adgang til denne side kræver godkendelse. Du kan prøve at ændre mapper.
Delta Lake has the following distinct options for selective overwrites:
| Option | Use case | Supported compute types | Minimum version |
|---|---|---|---|
REPLACE WHERE |
Atomically overwrite rows that match a predicate. Use for replacements with a fixed matching condition, such as colA = 5 or int_col IN (1, 2, 3). |
All compute types. | SQL in Databricks Runtime 12.2 LTS and above. Python and Scala in Databricks Runtime 9.1 LTS and above. |
REPLACE USING |
Dynamic data overwrite. Replaces all rows that match the specified columns, based on equality comparison of the column values, in the provided data set. | All compute types. | SQL in Databricks Runtime 16.3 and above. Python and Scala in Databricks Runtime 18.2 and above. |
REPLACE ON |
Dynamic data overwrite by boolean expression. Use for replacements with a complex or NULL-safe matching condition, such as s.colA <=> t.colA AND s.colB <=> t.colB. |
All compute types. | SQL in Databricks Runtime 17.1 and above. Python and Scala in Databricks Runtime 18.2 and above. |
partitionOverwriteMode |
Legacy dynamic partition overwrite, which overwrites all the existing data in each partition for which the write will commit new data. Not recommended for new workloads. | SQL supports classic compute only. Python and Scala supports all compute types. | SQL, Python, and Scala in Databricks Runtime 11.3 LTS and above. |
For most use cases, Databricks recommends using REPLACE USING or REPLACE WHERE. Use REPLACE ON only if your use case requires complex or NULL-safe matching conditions.
For detail on each option's replacement behavior, see INSERT.
In Scala and Python, replaceOn and replaceUsing can't be used in combination with replaceWhere, partitionOverwriteMode, or overwriteSchema.
For empty source queries, both REPLACE USING and REPLACE ON do not delete data, however, REPLACE WHERE might delete data.
Important
If data has been accidentally overwritten, you can use restore to undo the change.
REPLACE WHERE
You can selectively overwrite only the data that matches an arbitrary expression with REPLACE WHERE.
To atomically replace events in January in the target table, which is partitioned by start_date, with the data in replace_data:
Python
(replace_data.write
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.saveAsTable("events")
)
Scala
replace_data.write
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.saveAsTable("events")
SQL
INSERT INTO TABLE events REPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31' SELECT * FROM replace_data
This sample code writes out the data in replace_data, validates that all rows match the predicate, and performs an atomic replacement using overwrite semantics. If any values in the operation fall outside the predicate, this operation fails with an error by default.
On classic compute, to change this behavior to overwrite values within the predicate range and insert records outside the specified range, remove the constraint check by setting spark.databricks.delta.replaceWhere.constraintCheck.enabled to false:
Python
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", False)
Scala
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", false)
SQL
SET spark.databricks.delta.replaceWhere.constraintCheck.enabled=false
Note
REPLACE WHERE accepts a boolean_expression with some restrictions. See INSERT in the SQL language reference.
For empty source queries, REPLACE WHERE might delete table rows.
Legacy behavior
Legacy replaceWhere is only available on classic compute. See Classic compute overview.
If you use the legacy behavior of replaceWhere, queries overwrite data that matches a predicate only over partition columns. The following command would atomically replace the month January in the target table, which is partitioned by date, with the data in df:
Python
(df.write
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.saveAsTable("people10m")
)
Scala
df.write
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.saveAsTable("people10m")
To use legacy behavior, set spark.databricks.delta.replaceWhere.dataColumns.enabled to false:
Python
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", False)
Scala
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)
SQL
SET spark.databricks.delta.replaceWhere.dataColumns.enabled=false
Dynamic data overwrites
Dynamic data overwrites selectively replace data that match the specified key columns or boolean expression, leaving all other data unchanged. Partitioned tables, unpartitioned tables, and tables with liquid clustering are all supported.
Dynamic partition overwrites are a subset of dynamic data overwrite behavior. Dynamic partition overwrites replace all the existing data in each partition for which the write will commit new data and leaves all other partitions unchanged. Only partitioned tables are supported.
REPLACE USING
SQL supported in Databricks Runtime 16.3 and above. Python and Scala supported in Databricks Runtime 18.2 and above. For behavior differences in Databricks Runtime 16.3 to 17.1 see Legacy behavior.
REPLACE USING enables compute-independent, atomic overwrite behavior that works on Databricks SQL warehouses, serverless compute, and classic compute. REPLACE USING doesn't require that you set a Spark session configuration.
REPLACE USING replaces rows when the specified columns compare equal under equality. All other data remains unchanged.
Use dynamic data overwrite with REPLACE USING:
Python
(sourceDataDF.write
.mode("overwrite")
.option("replaceUsing", "event_id, start_date")
.saveAsTable("events")
)
Scala
sourceDataDF.write
.mode("overwrite")
.option("replaceUsing", "event_id, start_date")
.saveAsTable("events")
SQL
INSERT INTO TABLE events
REPLACE USING (event_id, start_date)
SELECT * FROM source_data
For empty source queries, REPLACE USING does not delete any table rows.
For complex or NULL-safe matching conditions, use REPLACE ON instead. See REPLACE ON.
See INSERT in the SQL language reference.
Legacy behavior
In Databricks Runtime 16.3 to 17.1, REPLACE USING uses legacy behavior and only allows dynamic partition overwrites, whereas Databricks Runtime 17.2 and above allows for dynamic data overwrites.
Keep the following constraints and behaviors in mind for REPLACE USING legacy behavior:
- You must specify the full set of the table's partition columns in the
USINGclause. - Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.
REPLACE ON
SQL supported in Databricks Runtime 17.1 and above. Python and Scala supported in Databricks Runtime 18.2 and above.
REPLACE ON replaces rows when they match a user-defined condition, unlike REPLACE USING, which replaces rows when the specified columns compare equal under equality. Use REPLACE ON when you need matching logic that REPLACE USING does not support, such as treating NULL values as equal.
Optionally, use the targetAlias option to specify an alias for the target table and the .as() or .alias() APIs to specify an alias for the source data.
For SQL syntax, see INSERT.
Python
(sourceDataDF.alias("s")
.write
.mode("overwrite")
.option("targetAlias", "t")
.option("replaceOn", "s.event_id <=> t.event_id AND s.start_date <=> t.start_date")
.saveAsTable("events")
)
Scala
sourceDataDF.as("s")
.write
.mode("overwrite")
.option("targetAlias", "t")
.option("replaceOn", "s.event_id <=> t.event_id AND s.start_date <=> t.start_date")
.saveAsTable("events")
SQL
INSERT INTO TABLE events AS t
REPLACE ON (s.event_id <=> t.event_id AND s.start_date <=> t.start_date)
(SELECT * FROM source_data) AS s
For empty source queries, REPLACE ON does not delete any table rows.
Dynamic partition overwrites with partitionOverwriteMode (legacy)
Important
This feature is in Public Preview.
Databricks Runtime 11.3 LTS and above supports dynamic partition overwrites for partitioned tables using overwrite mode: either INSERT OVERWRITE in SQL, or a DataFrame write with df.write.mode("overwrite"). This type of overwrite is only available for classic compute, not Databricks SQL warehouses or serverless compute.
Warning
When possible, use INSERT REPLACE USING instead of partition overwrite INSERT OVERWRITE PARTITION and spark.sql.sources.partitionOverwriteMode=dynamic. Partition overwrite may use stale data when partitioning changes.
To use dynamic partition overwrite mode, set the Spark session configuration spark.sql.sources.partitionOverwriteMode to dynamic. Alternatively, you can set the DataFrameWriter option partitionOverwriteMode to dynamic. If present, the query-specific option overrides the mode defined in the session configuration. The default value for spark.sql.sources.partitionOverwriteMode is static.
The following example uses partitionOverwriteMode:
SQL
SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;
Python
(df.write
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")
)
Scala
df.write
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")
Keep the following constraints and behaviors in mind for partitionOverwriteMode:
- You can't set
overwriteSchematotrue. - You can't specify both
partitionOverwriteModeandreplaceWherein the sameDataFrameWriteroperation. - If you specify a
replaceWherecondition using aDataFrameWriteroption, Delta Lake applies that condition to control which data is overwritten. This option takes precedence over thepartitionOverwriteModesession-level configuration. - Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.