Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Lakeflow Spark Declarative Pipelines (SDP) supports both SQL and Python interfaces for defining batch and streaming pipelines. Both interfaces produce the same underlying dataflow graph, so they provide equivalent functionality for most data processing. They differ in flexibility, accessibility, and feature coverage.
Use this guidance to decide which interface to use:
- If you can express your logic in SQL, use SQL.
- If you need programmatic control or a Python-only feature, use Python.
- If you're more comfortable with Python, use Python. It covers the full pipeline feature set, so familiarity is reason enough. The same isn't true in reverse: SQL doesn't cover every feature, so don't choose it on familiarity alone.
You can also combine both interfaces in the same pipeline. See Mix SQL and Python.
When to use SQL
SQL is a good fit when you want:
- Readable, declarative definitions: Clear logic that data engineers and analysts can maintain.
- Standard table types: Pipelines built mostly from streaming tables and materialized views.
- Linear transformation chains: Straightforward ingestion and transformation, such as a bronze-to-silver-to-gold flow, without procedural logic.
- Standalone tables: Standalone streaming tables or materialized views, which you author in SQL.
For an overview of developing pipelines in SQL, see Develop Lakeflow Spark Declarative Pipelines code with SQL.
When to use Python
Python is a good fit when you need:
- Programmatic control: Loops, conditionals, and metaprogramming to generate pipeline definitions dynamically.
- External libraries: Python packages such as
fakerorboto3. See Manage Python dependencies for pipelines. - User-defined functions (UDFs): You define UDFs in Python and can call them from both Python and SQL source files. See User-defined scalar functions - Python.
- Python-only features:
create_auto_cdc_from_snapshot_flow()to apply change data capture from a database snapshot.create_sink()andforeach_batch_sink()to write to external event streaming or Delta destinations.
For an overview of developing pipelines in Python, see Develop pipeline code with Python.
Mix SQL and Python
A single pipeline can combine SQL and Python definitions, but each language must be in a separate source file. For example, you can define your bronze and silver tables in Python and your gold tables in SQL.
Feature availability
The following table compares how each interface supports common pipeline features:
| Feature | SQL | Python |
|---|---|---|
| streaming table | CREATE STREAMING TABLE |
create_streaming_table(), table() |
| materialized view | CREATE MATERIALIZED VIEW |
materialized_view() |
| Temporary view | CREATE TEMPORARY VIEW |
temporary_view() |
| Private table | CREATE PRIVATE STREAMING TABLE, CREATE PRIVATE MATERIALIZED VIEW |
table(private=True) |
| Auto CDC | AUTO CDC ... INTO |
create_auto_cdc_flow() |
| Auto CDC from snapshot | Not supported | create_auto_cdc_from_snapshot_flow() |
| Flow | CREATE FLOW |
append_flow() |
| Sink | Not supported | create_sink(), foreach_batch_sink() |
| Expectations | CONSTRAINT ... EXPECT |
expect(), expect_or_drop(), expect_or_fail(), and the expect_all variants |
Decision summary
If you need...
| Goal | Recommended interface |
|---|---|
| Simplicity and readability | SQL |
| Quick declarative setup | SQL |
| Standalone streaming table or materialized view | SQL |
| Conditional or looping logic | Python |
| UDFs or external Python libraries | Python |
| Auto CDC from snapshot or sinks | Python |
| Full programmatic control and modularity | Python |