@Janice Chi Yes, absolutely. Starting with Spark Structured Streaming (without enabling autoscaling) is a solid and widely used approach, especially when:
- You want full control over cluster size and performance tuning
- You're working with SLA-sensitive streaming (like Kafka CDC)
- You’re still experimenting and don’t want the overhead of Delta Live Tables (DLT) just yet
Also, you’re right to be cautious about autoscaling. In practice, Databricks’ autoscaling can lag a bit on scale-in, and it’s something they themselves recommend avoiding in latency-sensitive streaming pipelines. Your plan to manually set a worker range and possibly use asynchronous autoscaling for smoother scale-in is a good move.
Migrating from DBR to DLT Later — What’s the Effort?
Code Format & Rewrite Requirements
It depends on how your current logic is written but if you structure things thoughtfully, the transition can be relatively smooth.
- If your code uses standard DataFrame APIs (SQL or Python) and writes to Delta tables, you can likely reuse most of it in DLT with only minor adjustments.
- If you're using
foreachBatch
to push data directly into Azure SQL (common in CDC), that part won’t carry over to DLT it doesn’t support custom sink logic likeforeachBatch
. In DLT, you’d typically write to Delta tables first, then sync to SQL via ADF or another downstream process. - More complex logic (e.g., streaming joins, window aggregations, manual checkpointing) might also need rework since DLT manages orchestration and state differently.
Notebook Reuse & Compatibility
Yes, you can reuse notebooks inside DLT pipelines. Just be mindful of structure:
- DLT expects your logic to be wrapped in decorators like
@dlt.table
or@dlt.view
. You can find the Python API reference for DLT here - It organizes your pipeline as a DAG, so breaking your logic into clear steps/modules helps
- Logging, checkpointing, and retries are handled by DLT automatically, so some parts of your DBR logic may become redundant
Cost and When to Choose DLT
There’s no official public cost benchmark doc yet, but based on customer feedback and usage patterns:
DLT is Ideal When:
- You want auto-managed recovery, testing, and data quality checks
- Your pipeline has multiple dependencies across batch and streaming tables
- You need built-in observability (event logs, quality metrics)
- You’re managing multiple pipelines and need better governance and automation
DBR Might Be Better When:
- You need custom sink logic, like writing directly to Azure SQL
- You want fine-grained control over cluster resources and scheduling
- Budget is tight and you want to avoid the per-pipeline cost model of DLT
For pricing, you can model both options using the Azure Pricing Calculator by comparing Databricks compute usage (standard vs. DLT pipelines) for your projected workloads.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.