Hi @Janice Chi
It looks like you're tackling some serious challenges with your Azure Databricks Structured Streaming pipeline - particularly around managing Kafka load spikes without using autoscaling.
You're absolutely right to avoid autoscaling in your setup - with stateful structured streaming and merge operations, autoscaling can introduce instability during scale-in, risking state loss and memory issues.
To address your key question - Databricks currently does not support dynamic cluster resizing (scale-up) for an active streaming job without stopping and restarting it. Structured Streaming jobs are tied to a fixed Spark session, so any cluster size change requires a restart.
Given that, here are some practical and safe approaches to handle unpredictable Kafka load spikes:
Recommended Strategy:
Two-Job Model (Baseline + Surge):
- Run a small, fixed-size baseline job continuously to handle regular load.
- When Kafka lag increases beyond a defined threshold (e.g., using consumer lag metrics), trigger a second, larger cluster job to temporarily consume and process the backlog.
- This lets you scale-out temporarily without touching the original streaming job.
Kafka Lag Monitoring + Triggering Logic:
- Use tools like Azure Monitor, Prometheus, or custom logic to watch Kafka lag.
- When lag crosses a threshold, trigger the larger cluster job using Databricks REST API, Azure Function, or Logic App.
- Once load returns to normal, stop the surge job gracefully.
Checkpoint Consistency:
- If a cluster scale-up is unavoidable, ensure you're stopping the job cleanly and reusing the exact same checkpoint location. This prevents data replay or loss during restart.
Delta Performance Tuning:
- Since you're using merge/deduplication, ensure your Delta tables are partitioned properly, Z-ordered on merge keys, and use micro-batch tuning (
trigger(interval)
,maxFilesPerTrigger
) to control throughput efficiently. - Consider splitting your logic - use streaming for ingestion and defer heavier merge logic to periodic batch jobs if needed.
While autoscaling is not advisable in your case, a two-job model with Kafka-lag-based job triggering gives you the flexibility to handle dynamic spikes without compromising reliability or overpaying during idle periods. There’s no supported way today to dynamically increase cluster size without restarting the streaming job.
For more info, check out the articles I shared in earlier threads with you:
- https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/production
- https://community.databricks.com/t5/data-engineering/exploring-additional-cost-saving-options-for-structured/td-p/22770
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.