Recon in NRT -Near Real Time

Janice Chi 100 Reputation points
2025-06-04T14:10:52.1+00:00

We are currently implementing a large-scale data migration project involving three distinct phases:

  1. One-Time Historical Migration from IBM DB2 (snapshot-based) to Azure ADLS Gen2 and finally into Azure SQL Hyperscale using ADF and Databricks.
  2. CDC (Change Data Capture) using log-based CDC via Kafka from DB2 to Databricks, followed by ingestion into Hyperscale.

Real-Time Streaming where committed transactions from DB2 are pushed via Kafka to Databricks and then streamed directly into Azure SQL Hyperscale.

We have finalized reconciliation (recon) approaches for the first two stages:

  • For one-time historical migration, recon is done after the complete ingestion, using row count and hash value comparison between DB2 and Hyperscale.

Q. For catch-up CDC, recon is again post-ingestion, where each batch or logical window is validated after it fully lands in Hyperscale. is this one is right approach and why ?

However, we need your expert recommendation for the real-time streaming phase, which is now in design.

Our Streaming Architecture:

Source: DB2 (via IBM CDC)

Streaming Channel: Kafka on GCP

Processing Engine: Databricks Structured Streaming (Delta Bronze/Silver)

Sink: Azure SQL Hyperscale

Key Questions:

  1. What is the recommended approach for recon in real-time streaming pipelines, given that data flows continuously?
    • We want to perform on-the-fly recon, but recon will only be technically possible after data lands in Hyperscale. Is this acceptable? if yes how we should do recon explian steps Or should some validation be done earlier in the pipeline?
    If a recon check fails in real-time streaming, what are the best practices to: Identify failed rows/partitions
      Handle retries or rollbacks in Hyperscale
    
         Avoid data duplication or inconsistent state
    
         **Are there built-in or Microsoft-recommended patterns** (e.g., watermarking, offset tracking, delta-to-Hyperscale integrity validation) to ensure reconciliation remains robust yet cost-effective in high-velocity streams?
    
         **Can you share any architecture references or best practices** for implementing real-time reconciliation pipelines that involve streaming data from Kafka to Hyperscale?
    

Our goal is to ensure row-level and column-level consistency between the DB2 source and Hyperscale target, with minimal latency and full auditability, even during real-time data movement.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,483 questions
{count} votes

Accepted answer
  1. J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
    2025-06-04T15:28:45.2866667+00:00

    Hi @Janice Chi

    For catch-up CDC, recon is again post-ingestion, where each batch or logical window is validated after it fully lands in Hyperscale. is this one is right approach and why ?

      Yes, validating each batch or logical window after it lands in Azure SQL Hyperscale for catch-up CDC is the correct and recommended approach, and here's why:

    Post-ingestion validation is effective, deterministic, and aligns with logical data boundaries emitted by CDC. It reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found. It's a production-grade, scalable approach widely used in real-world data platforms. 

    What is the recommended approach for recon in real-time streaming pipelines, given that data flows continuously?

    Real-time recon is best handled using micro-batch post-ingestion validation, while ensuring traceability via metadata. 

    Key Practices:

    1. Use Kafka offset metadata (offset, partition, topic, commit timestamp) in every message ingested into Databricks Bronze tables. 
    2. Process via Databricks Structured Streaming using Delta architecture: Bronze → Silver → Hyperscale. 
    3. Store micro-batch stats: 
      1. min_offset, max_offset 
      2. record_count, record_hash 
      3. batch_id, timestamp 
    4. After sink into Azure SQL Hyperscale, validate by comparing these metrics with the target table. 

    Is recon after landing in Hyperscale acceptable? If yes, how should we do it? 

    It’s both acceptable and recommended. Real-time pipelines are designed for continuous data flow, and pausing them for mid-stream validation is impractical. 

    How to perform Recon:

    1. Track Kafka metadata and hashes in Databricks control tables. 
    2. Use foreachBatch to write each micro-batch to Hyperscale and log offsets + record count. 
    3. Run a validation SQL job post-ingestion that: a. Joins the Hyperscale table with the control table.   b. Compares row count, hash values.   c. Logs the outcome to a Recon Table.   
    4. Send alert if mismatch and flag the offset range as failed. 

    What if a recon check fails? How to: 

    1. Identify the failed rows or partitions: 
      • Each micro-batch should log min_offset, max_offset, and batch_id into a control table. 
      • If the validation fails, the affected offset range or primary keys are marked as dirty. 
      • You can replay them from Kafka using the stored offsets. 
    2. Handle retries or rollbacks in hyperscale: To ensure idempotent writes to Azure SQL Hyperscale during real-time streaming, implement logic that prevents duplicate or inconsistent data. Use MERGE INTO statements to update or insert records based on a unique key (such as a primary key or composite business key). Additionally, enforce unique constraints and optionally apply hashing to detect changes at the row level. In case of reconciliation failures, trigger automated retry workflows using Azure Data Factory (ADF),allowing failed batches to be reprocessed without impacting already-ingested data. This approach ensures consistency, traceability, and resilience in your streaming pipeline. 
    3. Avoid Duplication and inconsistent state:
      In the Silver layer of Databricks, use the dropDuplicates() function based on a combination of the primary key and commit timestamp to eliminate duplicate records and ensure data accuracy before loading into the final destination Maintain a staging table before writing to Azure SQL Hyperscale to provide a rollback point in case of failures or validation mismatches, enhancing data reliability and control in the streaming pipeline. 

    Architecture reference or best practices for recon in Kafka → Databricks → Hyperscale pipelines? 

    Microsoft Architecture Recommend from my side

    Source: DB2 → IBM CDC → Kafka (GCP) 

    Stream Processor: Databricks Structured Streaming (Bronze, Silver layers) 

    Sink: Azure SQL Hyperscale via foreachBatch 

    Recon: Control + Audit tables tracking Kafka offsets, row counts, and hashes 

    Replay: On failure, use Kafka offset-based replay with replay logs 

    I hope this info helps, If any questions we are happy to help

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.