What is Lakeflow Spark Declarative Pipelines

Learn what Lakeflow Spark Declarative Pipelines (SDP) is, the core concepts (such as pipelines, streaming tables, and materialized views) that define it, the relationships between those concepts, and the benefits of using it in your data processing workflows.

Note

Lakeflow Spark Declarative Pipelines requires the Premium plan. Contact your Databricks account team for more information.

What is SDP?

Lakeflow Spark Declarative Pipelines is a declarative framework for developing and running batch and streaming data pipelines in SQL and Python. Lakeflow SDP extends and is interoperable with Apache Spark Declarative Pipelines, while running on the performance-optimized Databricks Runtime, and the Lakeflow Spark Declarative Pipelines flows API uses the same DataFrame API as Apache Spark and Structured Streaming.

Common use cases for SDP include:

  • Incremental data ingestion from sources such as cloud storage (Amazon S3, Azure ADLS Gen2, and Google Cloud Storage) and message buses (Apache Kafka, Amazon Kinesis, Google Pub/Sub, Azure EventHub, and Apache Pulsar).
  • Incremental batch and streaming transformations with stateless and stateful operators.
  • Real-time stream processing between transactional stores such as message buses and databases.

For more details on declarative data processing, see Procedural vs. declarative data processing in Databricks.

What are the benefits of SDP?

The declarative nature of SDP provides the following benefits compared to developing data processes with the Apache Spark and Spark Structured Streaming APIs and running them with the Databricks Runtime using manual orchestration via Lakeflow Jobs.

  • Automatic orchestration: SDP orchestrates processing steps (called "flows") automatically to ensure the correct order of execution and the maximum level of parallelism for optimal performance. Additionally, pipelines automatically and efficiently retry transient failures. The retry process begins with the most granular and cost-effective unit: the Spark task. If the task-level retry fails, SDP proceeds to retry the flow, and then finally the entire pipeline if necessary.
  • Declarative processing: SDP provides declarative functions that can reduce hundreds or even thousands of lines of manual Spark and Structured Streaming code to only a few lines. The SDP AUTO CDC API simplifies processing of Change Data Capture (CDC) events with support for both SCD Type 1 and SCD Type 2. It eliminates the need for manual code to handle out-of-order events, and it does not require an understanding of streaming semantics or concepts like watermarks.
  • Incremental processing: SDP provides an incremental processing engine for materialized views. To use it, you write your transformation logic with batch semantics, and the engine only processes new data and changes in the data sources whenever possible. Incremental processing reduces inefficient reprocessing when new data or changes occur in the sources and eliminates the need for manual code to handle incremental processing.

Key concepts

The diagram below illustrates the most important concepts of Lakeflow Spark Declarative Pipelines.

A diagram that shows how the core concepts of SDP relate to each other at a very high level

Datasets

A pipeline produces three types of datasets, each with different processing semantics:

Dataset type How records are processed
Streaming table Each record is processed exactly once, assuming an append-only source. Streaming tables are suited for ingestion and incremental processing of continuously growing data.
Materialized view Results are recomputed as needed to reflect the current state of the data. Materialized views are suited for transformations, aggregations, or pre-computing results consumed by multiple downstream datasets.
View Evaluated on demand, not persisted. Use views for intermediate transformations and checks that do not need to be published to a catalog.

A streaming table is a form of Unity Catalog managed table that is also a streaming target. A streaming table can have one or more streaming flows (Append, AUTO CDC) written into it. You can define streaming flows explicitly and separately from their target streaming table, or implicitly as part of a streaming table definition.

A materialized view is also a form of Unity Catalog managed table and is a batch target. A materialized view can have one or more materialized view flows written into it. Materialized views differ from streaming tables in that you always define the flows implicitly as part of the materialized view definition.

For details, see Streaming tables and Materialized views.

When to use views, materialized views, and streaming tables

When implementing pipeline queries, choose the dataset type that best fits your use case.

Consider using a view to:

  • Break a large or complex query into easier-to-manage queries.
  • Validate intermediate results using expectations.
  • Reduce storage and compute costs for results you don't need to persist. Because tables are materialized, they require additional computation and storage resources.

Consider using a materialized view when:

  • Multiple downstream queries consume the table. Because views are computed on demand, a view is re-computed every time it is queried.
  • Other pipelines, jobs, or queries consume the table. Because views are not materialized, they can only be used within the same pipeline.
  • You want to inspect the results of a query during development. Because tables are materialized and can be queried outside of the pipeline, using tables during development can help validate the correctness of computations. After validating, convert queries that do not require materialization into views.

Consider using a streaming table when:

  • A query is defined against a data source that is continuously or incrementally growing.
  • Query results should be computed incrementally.
  • The pipeline needs high throughput and low latency.

Note

Streaming tables are always defined against streaming sources. You can also use streaming sources with AUTO CDC ... INTO to apply updates from CDC feeds. See The AUTO CDC APIs: Simplify change data capture with pipelines.

Flows

A flow is the foundational data processing concept in SDP which supports both streaming and batch semantics. A flow reads data from a source, applies user-defined processing logic, and writes the result into a target. SDP shares the same streaming flow type (Append, Update, Complete) as Spark Structured Streaming. (Currently, only the Append and Update flows are exposed.) For more details, see output modes in Structured Streaming.

Lakeflow Spark Declarative Pipelines also provides additional flow types:

  • AUTO CDC is a unique streaming flow in Lakeflow SDP that handles out of order CDC events and supports both SCD Type 1 and SCD Type 2. Auto CDC is not available in Apache Spark Declarative Pipelines.
  • Materialized view is a batch flow in SDP that only processes new data and changes in the source tables whenever possible.

For details, see Load and process data incrementally with Lakeflow Spark Declarative Pipelines flows.

Sinks

A sink is a streaming target for a pipeline and supports Delta tables, Apache Kafka topics, Azure EventHubs topics, and custom Python data sources. A sink can have one or more streaming flows (Append, Update) written into it.

For details, see Sinks in Lakeflow Spark Declarative Pipelines.

Pipelines

A pipeline is the unit of development and execution in Lakeflow Spark Declarative Pipelines, and is the container for the flows, streaming tables, materialized views, and sinks that you define. You use SDP by defining these objects in your pipeline source code and then running the pipeline. While your pipeline runs, it analyzes the dependencies of your defined objects and orchestrates their order of execution and parallelization automatically.

For details, see What are pipelines?.

Data ingestion

Pipelines support all data sources available in Azure Databricks. Databricks recommends using streaming tables for most ingestion use cases. For files in cloud object storage, Auto Loader provides incremental, idempotent loading. For streaming data, pipelines can ingest directly from message buses such as Apache Kafka, Azure Event Hubs, Amazon Kinesis, and Google Pub/Sub. See Load data in pipelines.

Data quality

Expectations are optional clauses on datasets that validate data as it flows through the pipeline. You define an expectation as a SQL boolean constraint and specify what happens when a record fails: warn, drop the record, or fail the update. See Manage data quality with pipeline expectations.

Delta integration

All tables created and managed by pipelines are Delta tables. They have the same guarantees as Delta Lake, including ACID transactions, time travel, and schema enforcement. Pipelines add additional table properties and perform automatic maintenance using predictive optimization, including OPTIMIZE and VACUUM operations. See What is Delta Lake in Azure Databricks?.

Additional resources