What to Use for Kafka-Based Structured Streaming: Databricks or Delta Live Tables?

Janice Chi 140 Reputation points
2025-06-04T15:42:25.9666667+00:00

We are implementing a real-time streaming architecture where change data from on-prem DB2 flows through Kafka (already hosted on GCP) into Azure for downstream processing and storage.

Our requirements:

  • Process incoming Kafka messages in structured streaming mode

Apply basic transformations (column selection, casting, filtering)

Perform deduplication based on natural keys

Write the cleaned data to Azure SQL Hyperscale (via JDBC)

Maintain checkpointing and fault-tolerant streaming logic

Process approximately 800 topics (each representing one DB2 table)

We are evaluating whether to use regular Databricks structured streaming (notebooks/jobs) or Delta Live Tables (DLT) for this scenario.

Our main concerns:

Cost vs. value when scaling to 800 topics

Flexibility for debugging and custom logic (like recon, retries)

Operational manageability in production

Whether DLT truly simplifies or adds complexity at this scale

Question: Given the above, is Delta Live Tables the recommended solution, or should we use standard Databricks structured streaming with jobs/notebooks? Are there specific cases where DLT is not advisable for large-scale Kafka topic ingestion?

Appreciate any guidance based on best practices for large-scale ingestion and real-time processing in Azure.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,517 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator
    2025-06-04T17:06:49+00:00

    @Janice Chi

    Thanks for the detailed question, your use case is a classic example of large-scale real-time ingestion where choosing the right tool can significantly impact cost, maintainability, and operational efficiency.

    Given your requirements, especially the need to process ~800 Kafka topics, apply light transformations, perform deduplication, and maintain production-grade reliability. Here’s a comparison of Delta Live Tables (DLT) vs. standard Databricks Structured Streaming:

    Structured Streaming (Notebooks/Jobs)

    Category Pros Cons
    Flexibility Full control for custom logic like reconciliation, conditional retries, topic-specific handling Requires manual implementation for advanced features
    Scalability Scales better with high topic counts; supports dynamic job generation or orchestration Needs orchestration setup for managing multiple jobs
    Cost Efficiency Lower cost footprint — pay only for compute, no added DLT cost No built-in optimization or resource management features like in DLT
    Observability & Lineage Can integrate with Unity Catalog to enable lineage (manually) Lacks native built-in lineage, monitoring, and logging — must be custom implemented
    Operational Overhead More customizable for fine-grained operations Higher DevOps overhead: orchestration, monitoring, error handling need to be developed externally

    Delta Live Tables (DLT)

    Category Pros Cons
    Ease of Use Declarative pipeline syntax simplifies development and onboarding Limited flexibility for complex, topic-specific logic or custom error handling
    Built-in Features Automatically manages checkpointing, retries, pipeline orchestration, and schema enforcement May introduce abstraction overhead and reduce fine-grained control
    Observability & Lineage Native integration with Unity Catalog provides built-in lineage, monitoring, and data quality tracking Less transparency for debugging deeply nested or dynamic processing logic
    Operational Simplicity Reduces DevOps burden with built-in error handling, recovery, and scheduling Harder to dynamically scale or templatize for very large numbers of Kafka topics (like 800)
    Cost Considerations Predictable managed-service billing; operational value for small-to-moderate pipelines Higher total cost at scale due to additional charges beyond compute (especially with many pipelines)

    Recommendation

    Given your need to handle 800+ Kafka topics — each potentially with unique logic or schema and your requirements for fault tolerance, deduplication, and operational control:

    Databricks Structured Streaming (with notebooks or jobs) is the better fit at this scale.

    It offers the flexibility, cost-efficiency, and scalability needed for high-volume streaming workloads. You can template and parameterize your logic and use orchestrators like Databricks Workflows or Azure Data Factory to manage these jobs effectively.

    Delta Live Tables can be a good fit if:

    • You consolidate or group topics,
    • Pipelines share similar schemas or logic,
    • You want a simplified operational model for a smaller set of high-priority data flows.

    For additional information, please refer to the below Microsoft documentations:

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.