Guidance on Real-Time Streaming with Reconciliation and Domain Cutover Handling – Kafka → Databricks → Azure SQL Hyperscale

Janice Chi 320 Reputation points
2025-06-27T05:41:37.3766667+00:00

We are currently designing a real-time ingestion and reconciliation architecture using Databricks Structured Streaming that reads IBM DB2 CDC messages via Kafka and writes them to Azure SQL Hyperscale. The context is a healthcare data migration initiative involving four interdependent domains.

Architecture Background:

  • Source: IBM DB2 (on-prem) with InfoSphere CDC pushing data into Kafka.

Ingestion Tool: Databricks Structured Streaming (Scala/PySpark).

Sink: Azure SQL Hyperscale (Main Table).

Cutover Behavior:

Catch-up completes per domain (e.g., Domain A).

  Domain A becomes live in Hyperscale only, while B, C, and D are active in both source and target during streaming.
  

We would appreciate your recommendations on the following technical design challenges:


Trigger Design for Structured Streaming What is the recommended practice for defining micro-batch triggers in Databricks Structured Streaming under production workloads?

Should the trigger interval be time-based (e.g., 1 minute, 2 minutes) or based on a record count threshold (if supported)?

  Which approach offers the most balanced trade-off between **latency**, **throughput**, and **cluster stability**, especially in healthcare-grade workloads with JSON messages?
  
  **Performing Reconciliation Before Write to Hyperscale** For each micro-batch, we must perform reconciliation (e.g., row counts or hash checks) before writing to Azure SQL Hyperscale.
  
     What methods are recommended to ensure this **reconciliation happens within the same batch** without introducing latency?
     
        Is it appropriate to implement this using `foreachBatch`, and if so, what considerations should be followed to avoid blocking ingestion?
        
        **Handling CDC Data Overlap During Domain Cutover** After domain cutover (e.g., Domain A), old records may still arrive in Kafka due to lag or late CDC entries.
        
           How should we design our streaming pipeline to **prevent re-inserting previously merged records** from the catch-up load?
           
              Would it be appropriate to use techniques such as **load timestamp filtering**, **watermarking**, or **deduplication by primary key + timestamp**?
              
                 Is there a need to pause or restart the streaming job during cutover, or can it handle this logic dynamically?
                 
                 **Data Write into Hyperscale with Minimal Latency** We need to ensure that, after reconciliation passes, each validated batch is written to Hyperscale with minimal delay.
                 
                    What is the best practice for writing from Databricks to Azure SQL Hyperscale in streaming mode?
                    
                       Should we consider a two-step approach (e.g., insert into staging + merge to main), or directly insert into the main table if transformations and merges are already applied upstream?
                       

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,551 questions
{count} votes

Accepted answer
  1. Venkat Reddy Navari 3,630 Reputation points Microsoft External Staff Moderator
    2025-06-27T07:48:42.6066667+00:00

    Hi @Janice Chi Trigger Design in Databricks Structured Streaming: In most production setups, I’ve had the best results using a time-based trigger, like trigger(processingTime="1 minute"). It gives a good trade-off between low latency and stability—especially when you're dealing with JSON payloads from CDC like you are.

    Record-count-based triggers aren’t really supported natively in Structured Streaming, and they tend to be harder to control at scale anyway. A one-minute interval gives your cluster time to process batches predictably, and it plays nicely with autoscaling if you’re using it.

    If you notice lag or batch processing getting spiky, you can always fine-tune this, but a minute is a solid place to start.

    Reconciliation Before Writing to Azure SQL: Yes, foreachBatch is the right place to do in-batch reconciliation. Inside that function, you can validate counts, do hash checks, or even compare expected vs. actual rows before pushing to Azure SQL.

    Keep that logic efficient. Try to avoid joins or lookups that could slow the batch down. If you log audit results (e.g., counts, checksums) into a side table or log store, it’ll help you track issues without blocking the stream.

    Just make sure whatever reconciliation you do stays within the same micro-batch this way you don’t break consistency when writing to Hyperscale.

    Handling Late CDC Records After Domain Cutover: This is a common headache especially when you cut over Domain A, but Kafka still spits out some lagged messages.

    Best options here:

    • Deduplicate by primary key + timestamp before writing.
    • Watermarking can help drop messages that are too late, though you need to balance latency tolerance.
    • Add cutover logic in your stream you can reference a "domain state" table (e.g., Domain A = live) and drop/ignore any old records that shouldn’t be processed anymore.

    You don’t need to pause the stream during cutover. It’s better to handle this through logic build your stream to dynamically check cutover state and filter accordingly.

    Writing to Azure SQL Hyperscale with Minimal Latency: In streaming scenarios, writing directly into your main tables can be risky especially if reconciliation or merge logic is still needed. A safer and more scalable pattern is:

    • Write to a staging table.
    • Run a stored proc or merge operation into your target tables.

    This way, you can retry failed batches, audit changes, or even roll back bad loads more cleanly.

    If you do choose to write directly into the main tables, just make sure you’ve already done all your business rules, transformations, and deduplication upstream. Also tune your JDBC settings (batch size, retries, etc.) so you’re not bottlenecked at write time.


    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.