Clarification on Autoscaling Limitations for Structured Streaming and Use of DLT in Azure Databricks

Question

Clarification on Autoscaling Limitations for Structured Streaming and Use of DLT in Azure Databricks

Janice Chi 140

In our current streaming architecture built on Azure Databricks, we are using Structured Streaming workloads running on auto-scaling clusters. However, we’ve noticed in below link

https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/production

that the compute auto-scaling behavior tends to scale up efficiently but does not always scale down as expected, especially during low-throughput periods or idling stages.

We came across a recommendation in Databricks documentation suggesting that DLT (Delta Live Tables) with enhanced autoscaling capabilities might handle streaming workloads more efficiently, particularly in terms of resource optimization and cost control during variable load conditions.

We want to confirm the following with the Microsoft Databricks engineering team:

Is the current limitation in compute auto-scaling down behavior for Structured Streaming clusters officially acknowledged on Azure Databricks?

What specific enhancements does DLT autoscaling offer compared to traditional job clusters running Structured Streaming?

Can you provide Microsoft-backed benchmarks or configuration guidelines where switching to DLT has demonstrated better cost or performance efficiency for similar streaming workloads?

Are there any trade-offs or prerequisites we should be aware of when considering a migration from current Structured Streaming pipelines to DLT, especially for enterprise-scale ingestion pipelines?
please compare DBR and DLT from cost perspective and operational complexity perspective for near real time streaming

We are evaluating this as part of our long-term architectural planning and would appreciate clear guidance on whether migrating to DLT would be a future-proof and cost-efficient move in Azure environment.

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-05-29T09:54:06.2733333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-05-29T09:54:06.2733333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Smaran Thoomu 24,110 Microsoft External Staff Moderator

Hi @Janice Chi

Thank you for your detailed and thoughtful question regarding autoscaling behavior in Azure Databricks and your evaluation of Delta Live Tables (DLT) for structured streaming workloads.

Autoscaling Down limitations in Structured Streaming

You're right to observe this behavior - the scale-down limitations for autoscaling clusters running Structured Streaming are acknowledged in the Databricks documentation. Autoscaling works well during data surges but may retain executors during idle phases due to factors like:

Streaming state management and checkpointing requirements

Need to maintain active Spark sessions

Latency in decommissioning resources safely without data loss

DLT Autoscaling enhancements over DBR

Delta Live Tables (DLT), especially when using the Enhanced Autoscaling feature on Photon-enabled clusters, offers improvements such as:

Aggressive downscaling during idle or low-volume periods
Dynamic scaling tied to load inference, reducing unnecessary compute spend
Built-in retry and recovery, which makes scaling decisions safer and less disruptive
Declarative pipeline definitions, which simplify optimization and auto-tuning

Note: In contrast, traditional DBR-based streaming jobs need manual cluster tuning or over-provisioning for reliability, which often increases cost.

Cost and operational complexity comparison

Aspect	Structured Streaming on DBR	Delta Live Tables (DLT)
Autoscaling	Reactive, slow to scale down	Enhanced, responsive to idle periods
Autoscaling	Reactive, slow to scale down	Enhanced, responsive to idle periods
Cost Optimization	Higher during idle phases	More cost-efficient during variable loads
Management Overhead	Manual handling of checkpoints, retries	Managed checkpoints, auto-retries, lineage
Monitoring	Via Spark UI, custom logs	Built-in event logs, lineage, data quality
Dev/Op Simplicity	Requires Spark expertise	Declarative SQL/Python-based configuration
Best Use Case	Custom, fine-grained control needed	Enterprise pipelines needing reliability + scale

Benchmarks or configuration guidance

While Microsoft does not currently publish official benchmarks comparing DBR vs. DLT for every workload scenario, many enterprise customers have reported cost savings and operational simplification by switching to DLT - particularly when dealing with micro-batch streaming, schema enforcement, and data quality checks.

You can review the following resources for guidance:

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Janice Chi 140 Reputation points

2025-05-27T15:15:17.7333333+00:00

can you please explian below in details

Streaming state management and checkpointing requirements

Need to maintain active Spark sessions

Latency in decommissioning resources safely without data loss
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-05-28T09:53:39.5766667+00:00
@Janice Chi Thanks for your follow-up. Below is a detailed explanation of the factors that impact autoscaling down in Structured Streaming on Azure Databricks, particularly in the context of stateful workloads and long-running streaming queries:

Streaming state management and checkpointing requirements

Structured Streaming keeps state in memory when operations like groupBy, window, or streaming joins are used. This state is critical for correctness and must be preserved between batches.

When you checkpoint data (using a Delta or Parquet location), Spark uses that to recover state on failure.

If autoscaling removes an executor with part of the state, it risks disrupting ongoing queries unless the state is replicated or safely written to storage.

Reference: Structured Streaming Checkpointing - Databricks Docs

Need to maintain Active Spark Sessions

Structured Streaming uses long-lived Spark sessions that continuously process incoming data. Even during idle times (low or no incoming data), Spark must maintain:

Driver activity

Streaming query coordination

Minimum executor footprint

That’s why autoscaling doesn’t drop to zero - it preserves a baseline cluster state.

Latency in decommissioning Resources safely without Data Loss

Autoscaling decisions are conservative by design. Spark avoids aggressively terminating executors because:

There may be pending or in-flight tasks

Ongoing shuffles or writes could be disrupted

Scaling down too fast could cause retries or task failures

This leads to a lag between resource idleness and actual scale-down.

Why DLT handles this better

Delta Live Tables (DLT) was built to address many of these concerns:

Built-in checkpoint management

Intelligent autoscaling (especially with Photon)

Native retries and pipeline recovery

Declarative SQL/Python model simplifies tuning and maintenance

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Janice Chi 140 Reputation points

2025-05-29T11:38:59.1733333+00:00

Cost factor is also very important when selecting any solution , so can you please compare cost of Databricks vs DLT for structured streaming pipeline if we have to consum average 3500 events per second which can up to maximum 30,000 events per second as well from kafka topics

Smaran Thoomu 24,110 Microsoft External Staff Moderator

Janice Chi That’s a very important consideration, especially at scale and with fluctuating streaming volumes like the 3,500–30,000 events/sec range you’ve described.

Cost Comparison: DBR (Standard Structured Streaming) vs. DLT (Delta Live Tables)

Aspect	Structured Streaming on DBR	Delta Live Tables (DLT)
Autoscaling Efficiency	Scales up well, but scales down slowly (especially for stateful workloads)	Enhanced autoscaling scales down more aggressively during idle periods
Autoscaling Efficiency	Scales up well, but scales down slowly (especially for stateful workloads)	Enhanced autoscaling scales down more aggressively during idle periods
Idle Cost Management	Higher due to retained executors and manual tuning	Lower due to automatic downscaling and dynamic load-based compute provisioning
Resource Utilization	May lead to overprovisioning for peak load handling	Dynamically adapts to workload spikes like your Kafka max throughput scenario
Operational Overhead	Requires manual cluster and job tuning, checkpointing, error handling	Built-in checkpointing, retries, and monitoring reduce DevOps effort (translates to cost saving)
Photon Support	Available, but tuning required	DLT pipelines use Photon-enabled autoscaling clusters by default (faster, lower compute cost)
Cost Observations from Customers	Higher cost variability with peaks	Many customers report 20–40% cost savings for similar workloads after migrating to DLT

For your use case:

If your event ingestion varies between 3,500 and 30,000 events/sec, DLT offers better cost predictability and efficiency because:

You don’t have to over-provision for peaks.
DLT will scale up only during spikes and scale down aggressively when traffic is low.
Photon engine improves throughput while reducing total compute cost.

Note: Exact cost differences depend on cluster size, job design, and query complexity. However, many enterprise customers have achieved notable cost and operational gains by migrating to DLT for streaming + Kafka-based architectures.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Janice Chi 140 Reputation points

2025-05-29T14:14:07.1433333+00:00

can we have refernce of case studies -"Many customers report 20–40% cost savings for similar workloads after migrating to DLT"
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-05-30T08:37:54.6966667+00:00

@Smaran Thoomu Thanks again for your continued engagement.

We understand that evaluating the cost and architectural implications of Structured Streaming vs. Delta Live Tables (DLT) is a critical decision - and appreciate your thoroughness.

As mentioned earlier in another thread, Microsoft and Databricks currently do not publish public case studies or quantified cost benchmarks for specific customer scenarios. The 20–40% savings figure is based on aggregate insights from field engagements and customer feedback gathered over time, not from a single published source.

If you're looking for deeper insights tailored to your architecture or need formal case references, we recommend discussing this with your Databricks account team or your Microsoft representative, who may be able to explore options for more customer-specific guidance or engagement.

From our side, we’ve shared all currently available public documentation and guidance to support your evaluation. Please let us know if you need assistance modeling this for your own scenario - we would be happy to help.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Janice Chi 140 Reputation points

2025-06-04T14:56:50.6666667+00:00

we do not have end system as DELTA TABLES then why DLT , can't we use databricks only without DLT for our streaming phase , what will be cons of DLT as compared to DBR for structured streaming if we have to ingest 3000 events per sec from kafka as mean value and 25000 events per sec is our highest events per sec .

Smaran Thoomu 24,110 Microsoft External Staff Moderator

Hi Janice Chi

Regarding your question on the relevance of Delta Live Tables (DLT) when the target system does not involve Delta tables - it's an important consideration in determining the right streaming architecture.

Do you need DLT if your end system isn't Delta?

Not necessarily. If your primary requirement is to stream data from Kafka and land it into a non-Delta sink, then using Databricks Structured Streaming (DBR) is absolutely valid and may give you more flexibility in some cases. DLT is designed around Delta Lake pipelines, but technically you can write to other sinks - it's just more seamless and optimized when the entire flow stays within the Delta ecosystem.

DLT vs. DBR for Structured Streaming (3,000–25,000 events/sec)

Factor	Databricks Runtime (DBR)	Delta Live Tables (DLT)
Output Flexibility	Full control over sink -write to any format or system	Primarily designed for Delta, but can write to other systems with workarounds
Output Flexibility	Full control over sink - write to any format or system	Primarily designed for Delta, but can write to other systems with workarounds
Streaming Throughput	Can handle your 3k–25k events/sec use case with proper tuning	Also supports the same, with built-in scaling optimizations
Operational Overhead	Manual checkpointing, retries, and job orchestration	Managed pipeline with built-in recovery, lineage, monitoring
Autoscaling	Manual config and reactive	Enhanced autoscaling (Photon-enabled) more responsive during idle/load
Cost Predictability	Requires overprovisioning to handle spikes; idle cost can be high	Optimized for variable workloads; better idle cost management
Flexibility	Very flexible for complex/custom pipelines	More opinionated; less flexibility for advanced custom logic
Learning Curve	Requires Spark engineering know-how	More declarative and easier for data teams with less Spark expertise

DLT Limitations / Trade-offs (compared to DBR):

Tighter coupling to Delta format - Best suited when the destination is Delta tables.
Less flexibility - For advanced streaming logic (e.g., multi-sink logic, complex control flow), DBR may offer more freedom.
Pricing model - DLT pricing is based on pipeline complexity and compute usage; while efficient, it may cost more if you’re not leveraging its full features.

If your architecture does not involve Delta tables and you require high flexibility in managing custom sinks or processing logic, DBR with structured streaming is a valid and capable choice - especially if you're already managing Kafka ingestion effectively.

However, if you're looking to reduce operational burden, benefit from better autoscaling, and can adapt to Delta or compatible formats during processing, DLT remains a strong candidate - even if the final system is not Delta.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Janice Chi 140 Reputation points

2025-06-05T15:27:43.59+00:00
We are working on a regulated migration project where data flows from an on-prem IBM DB2 system through IBM InfoSphere CDC into Kafka on GCP, and is finally processed using Azure Databricks and written to Azure SQL Hyperscale. Azure Data Factory is used for orchestration.

Our architecture supports two distinct ingestion modes:

Catch-up (CDC) using fixed offset ranges (batch-oriented)

Real-time streaming using watermark logic (structured streaming) so where we should use DBR and where DLT --------------------------------------------------and why ------we came to know initailly that ------------------------------the scale-down limitations for autoscaling clusters running Structured Streaming are acknowledged in the Databricks documentation. Autoscaling works well during data surges but may retain executors during idle phases due to factors like:

Streaming state management and checkpointing requirements

Need to maintain active Spark sessions

Latency in decommissioning resources safely without data loss

DLT Autoscaling enhancements over DBR

Delta Live Tables (DLT), especially when using the Enhanced Autoscaling feature on Photon-enabled clusters, offers improvements such as:

Aggressive downscaling during idle or low-volume periods

Dynamic scaling tied to load inference, reducing unnecessary compute spend

Built-in retry and recovery, which makes scaling decisions safer and less disruptive

Declarative pipeline definitions, which simplify optimization and auto-tuning

Note: In contrast, traditional DBR-based streaming jobs need manual cluster tuning or over-provisioning for reliability, which often increases cost.

Smaran Thoomu 24,110 Microsoft External Staff Moderator

Janice Chi Got your point, let me break down where Databricks Runtime (DBR) vs. Delta Live Tables (DLT) might make the most sense in your setup - and why:

Catch-up (Batch CDC) Mode: use DBR

Since you're using fixed offset ranges, this is more of a controlled, batch-style operation, even if driven by CDC.
DLT is optimized for declarative streaming pipelines, but for batch ingestion (like catch-up loads), a traditional DBR job gives you more control, flexibility, and even simpler cost management.
You can use job clusters and schedule them via ADF without needing the always-on behavior of streaming pipelines.

Why DBR here: More flexibility, easier integration with SQL Hyperscale, and avoids the always-on cost model.

Real-Time Streaming (Watermark Logic): Consider DLT

Here's where DLT can really shine, especially if you're expecting fluctuating volumes (as you mentioned earlier — between 3,000 and 25,000 events/sec).
DLT offers:
- Enhanced autoscaling, especially on Photon clusters, which means better cost efficiency during low-traffic periods.
- Built-in checkpointing, retries, monitoring, and recovery, which reduces the operational burden significantly.
- Simpler declarative setup using SQL or Python for streaming logic - helpful if your team wants less hands-on cluster management.

Even though your sink is Azure SQL Hyperscale (not Delta), you can still use DLT -you would just need a write-to-external-sink logic inside your pipeline.

Why DLT here: Scales better with load, reduces idle cost, and minimizes manual monitoring/setup effort.

Mode	DBR (Databricks Runtime)	DLT (Delta Live Tables)
Catch-up CDC	More control, great for batch loads	Overhead not needed for fixed-range ingestion
Catch-up CDC	More control, great for batch loads	Overhead not needed for fixed-range ingestion
Real-time Streaming	Works, but may be costlier and harder to tune	Optimized for streaming with built-in autoscaling
Operational Overhead	Manual (checkpointing, scaling, retries)	Managed, automated
Sink Flexibility	Full flexibility (including SQL sink)	Delta preferred, but external sinks still possible
Cost Efficiency	May overprovision for peaks	Auto-tunes for bursts and idle phases

If you're still in the evaluation phase, it might be worth testing both paths on a sample load to compare cost and tuning needs. Also, since you're in a regulated setup, DLT’s built-in data lineage and quality checks could help with compliance reporting too.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Janice Chi 140

Can you please compare the cost of DBR vs DLT for a month and year for same stats for real time streaming , better if u can compare with azure pricing calculaor and share the links -between 3,000 and 25,000 events/sec

Real-time Streaming	Works, but may be costlier and harder to tune	Optimized for streaming with built-in autoscaling
Real-time Streaming	Works, but may be costlier and harder to tune	Optimized for streaming with built-in autoscaling

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-06T15:31:10.55+00:00
Janice Chi I completely understand why you're pushing for concrete numbers, especially with your event rates jumping between 3,000 and 25,000 events/second – cost is a huge factor.

Why exact cost comparisons are tricky

It's completely understandable why you'd want a solid cost comparison between DBR and DLT for your streaming workload. Cost is a huge factor, especially with your event rates fluctuating between 3,000 and 25,000 events/second.

The challenge here is that giving you a precise monthly or yearly cost figure or even linking to an Azure pricing calculator breakdown that would perfectly match your scenario, isn't something we can do. Here's why:

So Many Variables: Think of it like trying to price a custom-built house down to the last rupee without knowing the land cost, the specific materials, or all the architectural details. Databricks costs aren't just about how many events you process. They depend on:

The specific Azure VM types your clusters use.

The Databricks Runtime version – some are more efficient than others.

The complexity of your streaming logic (simple transformations vs. complex joins or stateful operations). A high event rate with simple processing might be cheaper than a lower rate with really complex logic.

How effectively autoscaling kicks in for your specific workload's idle periods and spikes. Even with DLT's enhanced autoscaling, your data patterns will dictate actual usage.

The size of each event, not just the count.

Your Azure region (prices vary).

Your Databricks tier (Standard, Premium, Enterprise).

Any Azure discounts you might have, like Reserved Instances.

No Public Benchmarks: Databricks and Microsoft don't publish public benchmarks comparing DBR and DLT for specific customer scenarios with quantified cost savings. While we often hear about 20-40% savings with DLT from various field engagements, that's an aggregate observation, not a fixed number you can plug into a calculator for your exact use case.

Calculators have Limits: The Azure and Databricks pricing calculators are great for general estimates based on broad assumptions (like "X hours of this VM type" or "Y DBUs per hour"). They just aren't designed to simulate the real-world, dynamic behavior of streaming workloads with variable event rates and the subtle differences in DBR vs. DLT autoscaling.

Pricing calculators that you can use for general estimations based on instance types, estimated DBU consumption, and usage hours.

https://azure.microsoft.com/en-in/pricing/calculator/

https://www.databricks.com/product/pricing/product-pricing/instance-types or

https://www.databricks.com/product/azure-pricing

We have provided all currently available public documentation and guidance to assist with your evaluation.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Janice Chi 140 Reputation points

2025-06-09T08:54:57.15+00:00
Thank you for your detailed response highlighting the variability in cost estimation between Databricks Structured Streaming (DBR) and Delta Live Tables (DLT). Based on your guidance, we’ve reviewed each influencing factor and would appreciate your validation on our assumptions and strategy.

We’ve responded inline to each variable mentioned, and have listed any open questions where we require confirmation or suggestions:

Azure VM Types

✅ Our assumption: We plan to use Standard_D4_v2 or Standard_E8s_v5 series for both DBR and DLT jobs.

🔄 Validation requested: Can you confirm if these are cost-effective for streaming loads with 3000–25,000 events/sec in terms of I/O and CPU performance?

Databricks Runtime Version

✅ Our assumption: We'll use Databricks Runtime 13.x with Structured Streaming for DBR and DLT with Enhanced Autoscaling for DLT.

🔄 Validation requested: Is there a DLT-optimized runtime version you recommend that offers better DBU efficiency for continuous streaming?

Streaming Logic Complexity

✅ Current scenario:

Stateless parsing and transformation (mostly schema normalization, trimming, deduplication)

Occasional UDF-based enrichment Upserts into Azure SQL Hyperscale 🔄 *Clarification needed:* How much overhead (in % DBU) does DLT introduce for lineage and orchestration compared to a direct DBR job doing the same logic?

Autoscaling Behavior

✅ Assumption for DBR: Manual cluster with fixed 8–10 nodes

✅ Assumption for DLT: Enhanced Autoscaling ON, single pipeline can scale between 1–15 workers

🔄 Clarification needed: Can you confirm if DLT can scale up and down faster or more efficiently than DBR in practice for 24x7 workloads with periodic spikes (e.g., 15-minute surge windows)?

Size of Each Event

✅ Our average size estimate: ~2 KB per Kafka message (JSON format). Peak 25,000 events/sec ≈ ~50 MB/sec

🔄 Question: Is there a threshold after which DLT performs better/worse for large message sizes compared to DBR?

Azure Region

✅ Target region: East US or Central India

🔄 Question: Do you recommend any region-specific VM series or autoscaling behavior changes for DLT?

Databricks Tier

✅ Our environment: Premium Tier is available.

🔄 Clarification: Is DLT available in Premium for production-grade streaming, or do we need Enterprise?

Reserved Instances or Savings Plans

✅ No discounts applied as of now.

🔄 Advice requested: Does Microsoft offer any cost planning tools or reference architecture POCs to simulate monthly cost under on-demand vs. Reserved VM pricing?

DLT Cost Overhead Estimate (Request for Validation)

✅ Our internal estimate:

DBR cost for 8 workers, 2 DBU/hr, 24x7 load: ~1,700–2,200 USD/month

DLT with 20 pipelines, avg 2.5 DBU/hr/pipeline: ~5,000–6,000 USD/month 🔄 *Request:* Please confirm if these assumptions are in a reasonable range for the described workload profile (event rate + logic + runtime).
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-09T11:54:16.6866667+00:00
Hi Janice Chi
Let’s go point by point to validate and fill in the gaps:

1.Azure VM Types

Standard_D4_v2: Older gen, okay for light-to-medium throughput but less I/O efficient.

Standard_E8s_v5: Much better choice here - higher memory-to-core ratio and Gen5 hardware makes it more suitable for sustained streaming workloads, especially when combined with Enhanced Autoscaling in DLT.

Recommendation: Prefer Esv5 or Dv5 series over D_v2. Also consider E4s_v5 for cost optimization on smaller pipelines.

2.Databricks Runtime Version

DBR 13.x is fine for both DBR and DLT streaming. No separate DLT-optimized runtime exists - DLT just wraps the same DBR under the hood.

However, Enhanced Autoscaling is only available starting from DBR 11.3 LTS and improves in later versions like 13.x.

No change needed here.

3.Streaming Logic Complexity

There is a DBU overhead for DLT due to orchestration, data lineage tracking, and event log maintenance.

From field experience: typically adds 15–25% more DBU usage over a plain Structured Streaming job doing the same logic - more if you use SCD merge/upsert extensively or have many dependencies (like chained tables).

So yes, your internal estimate of ~2.5 DBU/hr per pipeline aligns for moderately complex jobs.

4.Autoscaling Behavior

DLT Enhanced AutoscDLT Enhanced Autoscaling reacts faster to micro-spikes (e.g., 10–15 min event bursts) due to built-in stream-aware load estimation.

DBR autoscaling tends to scale conservatively (based on CPU/memory), whereas DLT can preemptively scale based on volume prediction.

In practice, for 24x7 streaming, DLT outperforms DBR in elasticity, especially when there’s a mix of idle and peak periods.

5.Size of Each Event

At ~2 KB per event, you’re within the sweet spot. No special tuning required.

If average event size grows beyond 8–10 KB, performance tuning (batch size, shuffle partitions, trigger interval) becomes critical in both DBR and DLT — but DLT may see higher latency due to metadata logging.

No red flags at 2 KB/event.

6.Azure Region

East US gives you access to more VM SKUs, lower cost variability, and usually better provisioning speed than Central India.

Autoscaling behavior is more responsive in East US due to faster VM allocation, especially during spikes.

If latency or SLA is tight, favor East US.

7.Databricks Tier

DLT is supported on Premium Tier, including Enhanced Autoscaling and continuous pipelines. Enterprise is not mandatory unless you require advanced compliance/security features.

So, you’re good with Premium.

8.Reserved Instances or Savings Plans

No official Microsoft simulation tool for DLT vs. DBR exists currently.

Best approach:

Use Azure Pricing Calculator for infra VM costs.

Combine with Databricks DBU calculator: https://www.databricks.com/product/azure-pricing

Optionally, build a small-scale POC pipeline in DLT, run it for 24 hours, and extrapolate DBU usage. This gives you the most accurate baseline.

No reserved instance planning tool for Databricks DBUs, but Azure RIs can reduce compute cost if you know your baseline node count.

9.DLT Cost Overhead Estimate (Request for Validation)

DBR (8 workers, 2 DBU/hr, 24x7) ≈ $1,700–2,200/month → aligns.

DLT (20 pipelines, 2.5 DBU/hr each) ≈ $5,000–6,000/month → reasonable if those 20 pipelines are running continuously.

You could reduce cost by consolidating logic where possible or scheduling less-frequent pipelines to trigger only when needed.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Janice Chi 140 Reputation points

2025-06-21T09:11:51.8366667+00:00
We would appreciate your guidance on the following points:

Is it expected behavior that scale-in latency is higher in DBR when using synchronous mode?

Does asynchronous mode in DBR improve scale-in performance, and are there any best practices around enabling it?

Does DLT internally handle scale-in operations using asynchronous logic by default?

Are there recommended configurations to fine-tune autoscaling behavior in both DBR and DLT to achieve optimal performance and resource efficiency?
Janice Chi 140 Reputation points

2025-06-22T10:54:47.5033333+00:00

does DLT support per-batch offset range configuration ?
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-23T01:22:37.6933333+00:00
Hi Janice Chi
Regarding on your follow up questions.

Is it expected behavior that scale-in latency is higher in DBR when using synchronous mode?

Yes, this is expected. In synchronous autoscaling, scale-in decisions are conservative to protect active workloads - particularly important in Structured Streaming, where state and shuffle data are memory-resident. The delay ensures that executors with active tasks aren't terminated prematurely, preserving stability but at the cost of latency in releasing resources.

Does asynchronous mode in DBR improve scale-in performance, and are there any best practices around enabling it?

Yes. Asynchronous autoscaling in DBR 13.x+ (with Photon or Delta optimizations) allows the cluster to scale in more responsively by evaluating idle executors in the background, even while active processing continues. This leads to faster deallocation of unused nodes.

Best practices include:

Use autoscaling-aware shuffle readers (spark.sql.adaptive.shuffle.targetPostShuffleInputSize) to minimize skew.

Monitor executor idle time and task backlog metrics.

Consider enabling spark.databricks.workload.autoScale.minExecutorsIdleDuration for aggressive scaling when justified.

Does DLT internally handle scale-in operations using asynchronous logic by default?

Yes. Delta Live Tables (DLT) pipelines leverage asynchronous autoscaling under the hood for both batch and streaming workloads. DLT abstracts cluster management using enhanced logic that separates job execution from infrastructure scaling. It automatically handles executor recycling and uses a cost-aware policy that allows for aggressive scale-in during idle periods while ensuring data consistency.

Are there recommended configurations to fine-tune autoscaling behavior in both DBR and DLT to achieve optimal performance and resource efficiency?

For DBR-based workloads (Structured Streaming):

Use autoscaling clusters with Photon (where applicable) and DBR 13.x or above.

Set spark.databricks.workload.autoScale.maxExecutors and minExecutors appropriately based on expected peak and baseline EPS.

Enable adaptive query execution (spark.sql.adaptive.enabled = true).

Avoid tight trigger intervals that may hinder executor recycling.

For DLT pipelines:

Choose “Enhanced Autoscaling” (enabled by default).

Monitor your pipeline’s throughput and latency in the event log to tune the pipeline settings like pipeline.photon.enabled and batch sizes.

Use target latency and throughput metrics to align expectations with observed resource usage.

Does DLT support per-batch offset range configuration?

DLT doesn’t expose direct control over Kafka offset ranges per batch, as Structured Streaming does with manual offset handling. Instead, DLT abstracts ingestion and maintains idempotency by checkpointing offsets automatically. If fine-grained control is required (e.g., replaying specific Kafka offsets), you'd typically revert to custom Structured Streaming pipelines outside DLT, using startingOffsets and endingOffsets options explicitly.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know
Janice Chi 140 Reputation points

2025-06-25T07:41:06.1066667+00:00
so can we use DBR with asynchronous mode and below recommendations and manage NRT for 3000 events/sec average to 30,000 events/sec max for maximum 2 hours in day without any noticebale issues or veryhigh cost differencece during scale in

NRT pipeline 24*7

For DBR-based workloads (Structured Streaming):

Use autoscaling clusters with Photon (where applicable) and DBR 13.x or above.

Set spark.databricks.workload.autoScale.maxExecutors and minExecutors appropriately based on expected peak and baseline EPS.

Enable adaptive query execution (spark.sql.adaptive.enabled = true).

Avoid tight trigger intervals that may hinder executor recycling.
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-25T11:51:09.3666667+00:00

Janice Chi The scope of your question is different compared to the original scope of the question asked.

I would recommend creating a new thread on the same forum with as much details about your issue as possible. That would make sure that your issue has better visibility in the community.

Share via

Clarification on Autoscaling Limitations for Structured Streaming and Use of DLT in Azure Databricks

1 answer

Your answer