Best Practices for Handling Kafka Load Spikes in Structured Streaming Without Autoscaling

Janice Chi 140 Reputation points
2025-06-15T11:45:19.4633333+00:00

❓Question for Microsoft/Databricks Team:

We are working on a stateful real-time CDC ingestion pipeline using Azure Databricks Structured Streaming, where:

Kafka (CDC topics from on-prem DB2 via IBM CDC) is our source.

Azure Databricks reads these topics using readStream and writes to Delta tables with merge/deduplication logic.

Our incoming load can dynamically vary between 3,000 to 30,000 events per second, with no fixed schedule or advance indication of peak load times.


🔍 Problem:

As per best practices, we have disabled autoscaling due to structured streaming stateful processing risks (e.g., state loss, memory inconsistency, Spark driver instability during scale-in).

However, this creates a challenge:

If we fix a large cluster size (e.g., 8 nodes) to handle 30K EPS, we unnecessarily incur high costs during low-volume periods (e.g., 3K EPS).

If we fix a small cluster for cost-efficiency, it will fail or lag during peak load.


❓We would like to understand:

  1. Is there any officially supported or recommended pattern to dynamically switch to a larger cluster without autoscaling and without stopping the streaming job?
  2. If this is not possible (as per current architecture), then what is best safe and recommended approach

🔧 Key Constraints:

We cannot use autoscaling due to streaming + merge requirements.

We cannot pre-know the spike timings.

We must ensure zero data loss, idempotent merge, and smooth cluster switch with minimum latency.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2025-06-16T01:30:37.5533333+00:00

    Hi @Janice Chi
    It looks like you're tackling some serious challenges with your Azure Databricks Structured Streaming pipeline - particularly around managing Kafka load spikes without using autoscaling.

    You're absolutely right to avoid autoscaling in your setup - with stateful structured streaming and merge operations, autoscaling can introduce instability during scale-in, risking state loss and memory issues.

    To address your key question - Databricks currently does not support dynamic cluster resizing (scale-up) for an active streaming job without stopping and restarting it. Structured Streaming jobs are tied to a fixed Spark session, so any cluster size change requires a restart.

    Given that, here are some practical and safe approaches to handle unpredictable Kafka load spikes:

    Recommended Strategy:

    Two-Job Model (Baseline + Surge):

    • Run a small, fixed-size baseline job continuously to handle regular load.
    • When Kafka lag increases beyond a defined threshold (e.g., using consumer lag metrics), trigger a second, larger cluster job to temporarily consume and process the backlog.
    • This lets you scale-out temporarily without touching the original streaming job.

    Kafka Lag Monitoring + Triggering Logic:

    • Use tools like Azure Monitor, Prometheus, or custom logic to watch Kafka lag.
    • When lag crosses a threshold, trigger the larger cluster job using Databricks REST API, Azure Function, or Logic App.
    • Once load returns to normal, stop the surge job gracefully.

    Checkpoint Consistency:

    • If a cluster scale-up is unavoidable, ensure you're stopping the job cleanly and reusing the exact same checkpoint location. This prevents data replay or loss during restart.

    Delta Performance Tuning:

    • Since you're using merge/deduplication, ensure your Delta tables are partitioned properly, Z-ordered on merge keys, and use micro-batch tuning (trigger(interval), maxFilesPerTrigger) to control throughput efficiently.
    • Consider splitting your logic - use streaming for ingestion and defer heavier merge logic to periodic batch jobs if needed.

    While autoscaling is not advisable in your case, a two-job model with Kafka-lag-based job triggering gives you the flexibility to handle dynamic spikes without compromising reliability or overpaying during idle periods. There’s no supported way today to dynamically increase cluster size without restarting the streaming job.
    For more info, check out the articles I shared in earlier threads with you:

    I hope this information helps. Please do let us know if you have any further queries.


    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.