Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.
Note
Delta Live Tables requires the Premium plan. Contact your Databricks account team for more information.
Instead of defining your data pipelines using a series of separate Apache Spark tasks, you define streaming tables and materialized views that the system should create and keep up to date. Delta Live Tables manages how your data is transformed based on queries you define for each processing step. You can also enforce data quality with Delta Live Tables expectations, which allow you to define expected data quality and specify how to handle records that fail those expectations.
To learn more about the benefits of building and running your ETL pipelines with Delta Live Tables, see the Delta Live Tables product page.
What are Delta Live Tables datasets?
Delta Live Tables datasets are the streaming tables, materialized views, and views maintained as the results of declarative queries. The following table describes how each dataset is processed:
Dataset type
How are records processed through defined queries?
Streaming table
Each record is processed exactly once. This assumes an append-only source.
Materialized view
Records are processed as required to return accurate results for the current data state. Materialized views should be used for data processing tasks such as transformations, aggregations, or pre-computing slow queries and frequently used computations.
View
Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.
The following sections provide more detailed descriptions of each dataset type. To learn more about selecting dataset types to implement your data processing requirements, see When to use views, materialized views, and streaming tables.
Streaming table
A streaming table is a Delta table with extra support for streaming or incremental data processing. Streaming tables allow you to process a growing dataset, handling each row only once. Because most datasets grow continuously over time, streaming tables are good for most ingestion workloads. Streaming tables are optimal for pipelines that require data freshness and low latency. Streaming tables can also be useful for massive scale transformations, as results can be incrementally calculated as new data arrives, keeping results up to date without needing to fully recompute all source data with each update. Streaming tables are designed for data sources that are append-only.
Note
Although, by default, streaming tables require append-only data sources, when a streaming source is another streaming table that requires updates or deletes, you can override this behavior with the skipChangeCommits flag.
Materialized view
A materialized view is a view where the results have been precomputed. Materialized views are refreshed according to the update schedule of the pipeline in which they’re contained. Materialized views are powerful because they can handle any changes in the input. Each time the pipeline updates, query results are recalculated to reflect changes in upstream datasets that might have occurred because of compliance, corrections, aggregations, or general CDC. Delta Live Tables implements materialized views as Delta tables, but abstracts away complexities associated with efficient application of updates, allowing users to focus on writing queries.
Views
All views in Azure Databricks compute results from source datasets as they are queried, leveraging caching optimizations when available. Delta Live Tables does not publish views to the catalog, so views can be referenced only within the pipeline in which they are defined. Views are useful as intermediate queries that should not be exposed to end users or systems. Databricks recommends using views to enforce data quality constraints or transform and enrich datasets that drive multiple downstream queries.
Delta Live Tables separates dataset definitions from update processing, and Delta Live Tables notebooks are not intended for interactive execution. See What is a Delta Live Tables pipeline?.
What is a Delta Live Tables pipeline?
A pipeline is the main unit used to configure and run data processing workflows with Delta Live Tables.
A pipeline contains materialized views and streaming tables declared in Python or SQL source files. Delta Live Tables infers the dependencies between these tables, ensuring updates occur in the correct order. For each dataset, Delta Live Tables compares the current state with the desired state and proceeds to create or update datasets using efficient processing methods.
The settings of Delta Live Tables pipelines fall into two broad categories:
Configurations that define a collection of notebooks or files (known as source code) that use Delta Live Tables syntax to declare datasets.
Configurations that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace.
Most configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following:
To make data available outside the pipeline, you must declare a target schema to publish to the Hive metastore or a target catalog and target schema to publish to Unity Catalog.
Data access permissions are configured through the cluster used for execution. Make sure your cluster has appropriate permissions configured for data sources and the target storage location, if specified.
Before processing data with Delta Live Tables, you must configure a pipeline. Once a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using Delta Live Tables pipelines, see Tutorial: Run your first Delta Live Tables pipeline.
What is a pipeline update?
Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:
Starts a cluster with the correct configuration.
Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
Creates or updates tables and views with the most recent data available.
Delta Live Tables supports all data sources available in Azure Databricks.
Databricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with Delta Live Tables from most message buses.
For formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See Load data with Delta Live Tables.
Monitor and enforce data quality
You can use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See Manage data quality with Delta Live Tables.
How are Delta Live Tables and Delta Lake related?
Delta Live Tables extends the functionality of Delta Lake. Because tables created and managed by Delta Live Tables are Delta tables, they have the same guarantees and features provided by Delta Lake. See What is Delta Lake?.
How tables are created and managed by Delta Live Tables
Azure Databricks automatically manages tables created with Delta Live Tables, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks.
For most operations, you should allow Delta Live Tables to process all updates, inserts, and deletes to a target table. For details and limitations, see Retain manual deletes or updates.
Maintenance tasks performed by Delta Live Tables
Delta Live Tables performs maintenance tasks within 24 hours of a table being updated. Maintenance can improve query performance and reduce cost by removing old versions of tables. By default, the system performs a full OPTIMIZE operation followed by VACUUM. You can disable OPTIMIZE for a table by setting pipelines.autoOptimize.managed = false in the table properties for the table. Maintenance tasks are performed only if a pipeline update has run in the 24 hours before the maintenance tasks are scheduled.
Limitations
The following limitations apply:
All tables created and updated by Delta Live Tables are Delta tables.
Delta Lake time travel queries are supported only with Streaming tables, and are not supported with materialized views. See Work with Delta Lake table history.
Delta Live Tables tables can only be defined once, meaning they can only be the target of a single operation in all Delta Live Tables pipelines.
Identity columns are not supported with tables that are the target of APPLY CHANGES INTO and might be recomputed during updates for materialized views. For this reason, Databricks recommends using identity columns in Delta Live Tables only with streaming tables. See Use identity columns in Delta Lake.
An Azure Databricks workspace is limited to 100 concurrent pipeline updates.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Learn about why incremental stream processing offered by Databricks’ Structured Streaming and Delta Live Tables is better for engineering data pipelines than batch ingestion and transformation.