Incrementally clone Parquet and Iceberg tables to Delta Lake

Artikkel
09/06/2024

You can use Azure Databricks clone functionality to incrementally convert data from Parquet or Iceberg data sources to managed or external Delta tables.

Azure Databricks clone for Parquet and Iceberg combines functionality used to clone Delta tables and convert tables to Delta Lake. This article describes use cases and limitations for this feature and provides examples.

Important

This feature is in Public Preview.

Note

This feature requires Databricks Runtime 11.3 LTS or above.

When to use clone for incremental ingestion of Parquet or Iceberg data

Azure Databricks provides a number of options for ingesting data into the lakehouse. Databricks recommends using clone to ingest Parquet or Iceberg data in the following situations:

Note

The term source table refers to the table and data files to be cloned, while the target table refers to the Delta table created or updated by the operation.

You are performing a migration from Parquet or Iceberg to Delta Lake, but need to continue using source tables.
You need to maintain an ingest-only sync between a target table and production source table that receives appends, updates, and deletes.
You want to create an ACID-compliant snapshot of source data for reporting, machine learning, or batch ETL.

What is the syntax for clone?

Clone for Parquet and Iceberg uses the same basic syntax used to clone Delta tables, with support for shallow and deep clones. For more information, see Clone types.

Databricks recommends using clone incrementally for most workloads. Clone support for Parquet and Iceberg uses SQL syntax.

Note

Clone for Parquet and Iceberg has different requirements and guarantees than either clone or convert to Delta. See Requirements and limitations for cloning Parquet and Iceberg tables.

To deep clone a Parquet or Iceberg table using a file path, use the following syntax:

CREATE OR REPLACE TABLE <target-table-name> CLONE parquet.`/path/to/data`;

CREATE OR REPLACE TABLE <target-table-name> CLONE iceberg.`/path/to/data`;

To shallow clone a Parquet or Iceberg table using a file path, use the following syntax:

CREATE OR REPLACE TABLE <target-table-name> SHALLOW CLONE parquet.`/path/to/data`;

CREATE OR REPLACE TABLE <target-table-name> SHALLOW CLONE iceberg.`/path/to/data`;

You can also create deep or shallow clones for Parquet tables registered to the metastore, as shown in the following examples:

CREATE OR REPLACE TABLE <target-table-name> CLONE <source-table-name>;

CREATE OR REPLACE TABLE <target-table-name> SHALLOW CLONE <source-table-name>;

Requirements and limitations for cloning Parquet and Iceberg tables

Whether using deep or shallow clones, changes applied to the target table after the clone occurs cannot be synced back to the source table. Incremental syncing with clone is unidirectional, allowing changes to source tables to be automatically applied to target Delta tables.

The following additional limitations apply when using clone with Parquet and Iceberg tables:

You must register Parquet tables with partitions to a catalog such as Unity Catalog or the legacy Hive metastore before cloning and using the table name to idenfity the source table. You cannot use path-based clone syntax for Parquet tables with partitions.
You cannot clone Iceberg tables that have experienced partition evolution.
You cannot clone Iceberg merge-on-read tables that have experienced updates, deletions, or merges.
The following are limitations for cloning Iceberg tables with partitions defined on truncated columns:
- In Databricks Runtime 12.2 LTS and below, the only truncated column type supported is string.
- In Databricks Runtime 13.3 LTS and above, you can work with truncated columns of types string, long, or int.
- Azure Databricks does not support working with truncated columns of type decimal.
Incremental clone syncs the schema changes and properties from the source table. Any schema changes and data files written directly to the cloned table are overridden.
Unity Catalog does not support shallow clones for Parquet or Iceberg tables.
You cannot use glob patterns when defining a path.

Note

In Databricks Runtime 11.3 LTS, this operation does not collect file-level statistics. As such, target tables do not benefit from Delta Lake data skipping. File-level statistics are collected in Databricks Runtime 12.2 LTS and above.

Del via

Incrementally clone Parquet and Iceberg tables to Delta Lake

When to use clone for incremental ingestion of Parquet or Iceberg data

What is the syntax for clone?

Requirements and limitations for cloning Parquet and Iceberg tables

Tilbakemeldinger

Flere ressurser