Cost, Scaling, and Migration Considerations – Spark Structured Streaming (DBR) vs. Delta Live Tables (DLT)

Janice Chi 140 Reputation points
2025-06-25T13:44:11.0633333+00:00

We are currently designing a near real-time streaming pipeline for a healthcare analytics workload and are evaluating Databricks Spark Structured Streaming (using DBR) versus Delta Live Tables (DLT) for implementation.

Project Context:

Ingestion source: Kafka-based CDC stream (~3,000 to 30,000 events/sec)

Target: Azure SQL Hyperscale

Current plan: Initially use standard Spark Structured Streaming in Databricks, without enabling auto-scaling

Cluster: We plan to manually tune min/max workers (e.g., 2 to 10 nodes), possibly using asynchronous auto-scaling (newer feature) to improve scale-in behavior

Concern: We've observed known limitations in Databricks with aggressive scale-in delays, and have read that even Databricks recommends avoiding auto-scaling in strict SLA-driven streaming workloads


Guidance Requested:

Can we safely start with DBR-based Spark Structured Streaming (without auto-scaling) for a real-time Kafka CDC pipeline, and defer DLT adoption until we observe any actual bottlenecks?

If we later decide to migrate the same Spark streaming logic to DLT, what will be the estimated effort:

Will this be a complete rewrite (SQL/Python format change)?

  Are there any known **incompatibilities or manual conversion steps** (e.g., `foreachBatch`, streaming joins, window operations, checkpointing)?
  
     Does DLT support seamless porting of existing notebook logic from DBR?
     
     Based on recent customer feedback and platform evolution, are there **clear cost benchmarks** or **guidance on which workloads justify DLT** over manually managed Spark streaming clusters?
     

Any clarification or recommended best practices for migrating from DBR to DLT with minimal disruption would be greatly appreciated.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,517 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 3,125 Reputation points Microsoft External Staff Moderator
    2025-06-25T15:26:31.8633333+00:00

    @Janice Chi Yes, absolutely. Starting with Spark Structured Streaming (without enabling autoscaling) is a solid and widely used approach, especially when:

    • You want full control over cluster size and performance tuning
    • You're working with SLA-sensitive streaming (like Kafka CDC)
    • You’re still experimenting and don’t want the overhead of Delta Live Tables (DLT) just yet

    Also, you’re right to be cautious about autoscaling. In practice, Databricks’ autoscaling can lag a bit on scale-in, and it’s something they themselves recommend avoiding in latency-sensitive streaming pipelines. Your plan to manually set a worker range and possibly use asynchronous autoscaling for smoother scale-in is a good move.

    Migrating from DBR to DLT Later — What’s the Effort?

    Code Format & Rewrite Requirements

    It depends on how your current logic is written but if you structure things thoughtfully, the transition can be relatively smooth.

    • If your code uses standard DataFrame APIs (SQL or Python) and writes to Delta tables, you can likely reuse most of it in DLT with only minor adjustments.
    • If you're using foreachBatch to push data directly into Azure SQL (common in CDC), that part won’t carry over to DLT it doesn’t support custom sink logic like foreachBatch. In DLT, you’d typically write to Delta tables first, then sync to SQL via ADF or another downstream process.
    • More complex logic (e.g., streaming joins, window aggregations, manual checkpointing) might also need rework since DLT manages orchestration and state differently.

    Notebook Reuse & Compatibility

    Yes, you can reuse notebooks inside DLT pipelines. Just be mindful of structure:

    • DLT expects your logic to be wrapped in decorators like @dlt.table or @dlt.view. You can find the Python API reference for DLT here
    • It organizes your pipeline as a DAG, so breaking your logic into clear steps/modules helps
    • Logging, checkpointing, and retries are handled by DLT automatically, so some parts of your DBR logic may become redundant

    Cost and When to Choose DLT

    There’s no official public cost benchmark doc yet, but based on customer feedback and usage patterns:

    DLT is Ideal When:

    • You want auto-managed recovery, testing, and data quality checks
    • Your pipeline has multiple dependencies across batch and streaming tables
    • You need built-in observability (event logs, quality metrics)
    • You’re managing multiple pipelines and need better governance and automation

    DBR Might Be Better When:

    • You need custom sink logic, like writing directly to Azure SQL
    • You want fine-grained control over cluster resources and scheduling
    • Budget is tight and you want to avoid the per-pipeline cost model of DLT

    For pricing, you can model both options using the Azure Pricing Calculator by comparing Databricks compute usage (standard vs. DLT pipelines) for your projected workloads.


    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.