Recommended workloads for Delta Live Tables

This article describes the recommended workloads for Delta Live Tables on Azure Databricks.

Data ingestion

Delta Live Tables can ingest data from append-only sources and sources that contain changes, such as change data capture (CDC) feeds. Streaming tables support functionality for both types of source data.

Ingest data from append-only data sources

Streaming tables are recommended for ingesting append-only data. Append-only means that only new data is added to the source data, and existing data is never updated or deleted. Examples of append-only data include:

  • A Delta table with the table property delta.appendOnly = true.
  • A cloud storage location that receives new files periodically.
  • A Kafka topic with events.

To learn more about ingesting data with Delta Live Tables and streaming tables, including examples, see Load data with Delta Live Tables.

Ingest data from a single append-only source

A streaming table can ingest data from any append-only data source.

Delta Live Tables workloads one append-only source.

Ingest data from multiple append-only sources

You can also ingest data from multiple append-only data sources into a streaming table. For example, you can write events from multiple Kafka topics into a single streaming table. To do this, define the query for the streaming table to read from one source, and for the other sources, use append flows.

Delta Live Tables workloads multiple append-only sources.

Ingest historical data from an append-only source

You can use a backfill when you have an existing dataset that ingests data from an append-only source and want to append historical data to the dataset exactly once. To do this, load the historical data using an append flow query. See backfill.

Delta Live Tables workloads append-only source with a one-time backfill.

Process change data feeds and database snapshots

Databricks recommends using Delta Live Tables to process change data feeds (CDF) that contain sequences of potentially out-of-order changes from one or more tables. Change data feeds are produced by Delta tables, in addition to systems such as Debezium, Qlik, and Amazon DMS. You can use either Python or SQL to process a change data feed with Delta Live Tables.

Databricks also recommends using Delta Live Tables when, instead of a change data feed, you need to process database snapshots, such as snapshots generated from an Oracle database, a MySQL database, or a data warehouse. Processing database snapshots is supported by the Delta Live Tables Python interface.

To process a CDF, use the APPLY CHANGES API. See How is CDC implemented with the APPLY CHANGES API?.

Delta Live Tables workloads ongoing change feed.

To process database snapshots, use the APPLY CHANGES FROM SNAPSHOT API. See How is CDC implemented with the APPLY CHANGES FROM SNAPSHOT API?.

Delta Live Tables sequence of complete snapshots.

Transform data

Delta Live Tables offers two solutions for transforming data. Materialized views are a good default choice as they always provide the correct result and automatically reprocess source data if needed. Streaming tables are recommended for low-complexity transformations over very large streams and are recommended for advanced use cases.

Transform data with materialized views

Materialized views are the recommended default for transformations in Delta Live Tables. They are simple and accurate. However, their downside is higher latency because materialized views might process all input data to ensure queries against the materialized view return the correct result.

Transform a single table with a materialized view

A materialized view can read from a Delta table or streaming table and perform arbitrary transformations on the input data. Materialized views can read all Delta tables, including those produced by systems other than Databricks, making them useful for migrations and hybrid pipelines.

Delta Live Tables materialized view for a single table.

Join a fact table with a dimension table (stream-snapshot join) with a materialized view

Materialized views can perform efficient, incremental joins between a base Delta table or streaming table and a “lookup” Delta table. These joins will be processed incrementally whenever possible. You do not need to use watermarks with materialized views and stream-snapshot joins.

Delta Live Tables materialized view stream-snapshot join.

Join two fact tables (stream-stream join)

Materialized views can perform efficient, incremental joins between two streaming tables or Delta tables. This is known as a stream-stream join, and materialized views will perform it incrementally whenever possible. You do not need to use watermarks with materialized views and stream-stream joins.

Delta Live Tables materialized view stream-snapshot join.

Transform data with streaming tables

Streaming tables are recommended when you need to transform high-volume streaming data with low latency.

Transform a single table with a streaming table

Streaming tables can be used to transform data from any Delta table or another streaming table.

The following caveat applies to this use case:

  • When you update the streaming table’s definition, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.

Delta Live Tables transform a single table.

Join a fact table with a dimension table (stream-snapshot join) using a streaming table

Streaming tables can join a fact table with a dimension table.

The following caveats apply to this use case:

  • When you update the streaming table’s definition, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.
  • When you update the lookup table, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.

Delta Live Tables streaming table stream-snapshot join.

Join two fact tables (stream-stream join) using a streaming table

Streaming tables can join two or more fact tables, also known as a stream-stream join.

The following caveats apply to this use case:

  • When you update the streaming table’s definition, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.
  • To avoid out-of-memory errors, you must use watermarks on both sides of the join and in aggregations.
  • Out-of-order and late-arrival data is not handled, which could lead to inaccurate data. Because of this, you need to manually handle out-of-order and late-arrival data.

See Use watermarks with stream-stream joins.

Delta Live Tables streaming table stream-stream join.