Recon in NRT -Near Real Time

Question

Recon in NRT -Near Real Time

Janice Chi 320

We are currently implementing a large-scale data migration project involving three distinct phases:

One-Time Historical Migration from IBM DB2 (snapshot-based) to Azure ADLS Gen2 and finally into Azure SQL Hyperscale using ADF and Databricks.
CDC (Change Data Capture) using log-based CDC via Kafka from DB2 to Databricks, followed by ingestion into Hyperscale.

Real-Time Streaming where committed transactions from DB2 are pushed via Kafka to Databricks and then streamed directly into Azure SQL Hyperscale.

We have finalized reconciliation (recon) approaches for the first two stages:

For one-time historical migration, recon is done after the complete ingestion, using row count and hash value comparison between DB2 and Hyperscale.

Q. For catch-up CDC, recon is again post-ingestion, where each batch or logical window is validated after it fully lands in Hyperscale. is this one is right approach and why ?

However, we need your expert recommendation for the real-time streaming phase, which is now in design.

Our Streaming Architecture:

Source: DB2 (via IBM CDC)

Streaming Channel: Kafka on GCP

Processing Engine: Databricks Structured Streaming (Delta Bronze/Silver)

Sink: Azure SQL Hyperscale

Key Questions:

What is the recommended approach for recon in real-time streaming pipelines, given that data flows continuously?

We want to perform on-the-fly recon, but recon will only be technically possible after data lands in Hyperscale. Is this acceptable? if yes how we should do recon explian steps Or should some validation be done earlier in the pipeline?

If a recon check fails in real-time streaming, what are the best practices to: Identify failed rows/partitions

  Handle retries or rollbacks in Hyperscale

     Avoid data duplication or inconsistent state

     **Are there built-in or Microsoft-recommended patterns** (e.g., watermarking, offset tracking, delta-to-Hyperscale integrity validation) to ensure reconciliation remains robust yet cost-effective in high-velocity streams?

     **Can you share any architecture references or best practices** for implementing real-time reconciliation pipelines that involve streaming data from Kafka to Hyperscale?

Our goal is to ensure row-level and column-level consistency between the DB2 source and Hyperscale target, with minimal latency and full auditability, even during real-time data movement.

Janice Chi 320 Reputation points

2025-06-05T06:47:12.2066667+00:00

please explian this -"Post-ingestion validation is effective, deterministic, and aligns with logical data boundaries emitted by CDC. It reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found. It's a production-grade, scalable approach widely used in real-world data platforms. "-please explian how recon post ingestion reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found.
J N S S Kasyap 4,085 Reputation points Microsoft External Staff Moderator

2025-06-05T08:28:20.0166667+00:00

@Janice Chi
In CDC pipelines, data arrives in small chunks (micro-batches) over time. If you try to validate data while it’s still coming in (during ingestion), you might check an incomplete batch some rows may not have landed yet. This leads to false mismatches.
Let’s say Kafka sends a CDC batch with 100 rows. If you start recon after only 70 rows have landed in Azure SQL Hyperscale, your count check will fail (it sees only 70 instead of 100), even though the full data hasn’t finished loading.
But if you wait until all 100 rows are written, then do your validation (post-ingestion), the count and hash will match correctly no false failure.
Since each batch is tied to a specific Kafka offset or timestamp, if recon fails, you can replay just that batch. This is better than reprocessing everything. And because the full batch is already landed, you know exactly what went wrong and can fix it cleanly.
Post-ingestion recon avoids checking partial data, prevents false alerts, and makes retries easier by isolating issues to specific batches.

I hope this info helpful
Smaran Thoomu 25,005 Reputation points Microsoft External Staff Moderator

2025-06-09T05:07:24.2733333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

Janice Chi 320 Reputation points

2025-06-05T06:47:12.2066667+00:00

please explian this -"Post-ingestion validation is effective, deterministic, and aligns with logical data boundaries emitted by CDC. It reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found. It's a production-grade, scalable approach widely used in real-world data platforms. "-please explian how recon post ingestion reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found.
J N S S Kasyap 4,085 Reputation points Microsoft External Staff Moderator

2025-06-05T08:28:20.0166667+00:00

@Janice Chi
In CDC pipelines, data arrives in small chunks (micro-batches) over time. If you try to validate data while it’s still coming in (during ingestion), you might check an incomplete batch some rows may not have landed yet. This leads to false mismatches.
Let’s say Kafka sends a CDC batch with 100 rows. If you start recon after only 70 rows have landed in Azure SQL Hyperscale, your count check will fail (it sees only 70 instead of 100), even though the full data hasn’t finished loading.
But if you wait until all 100 rows are written, then do your validation (post-ingestion), the count and hash will match correctly no false failure.
Since each batch is tied to a specific Kafka offset or timestamp, if recon fails, you can replay just that batch. This is better than reprocessing everything. And because the full batch is already landed, you know exactly what went wrong and can fix it cleanly.
Post-ingestion recon avoids checking partial data, prevents false alerts, and makes retries easier by isolating issues to specific batches.

I hope this info helpful
Smaran Thoomu 25,005 Reputation points Microsoft External Staff Moderator

2025-06-09T05:07:24.2733333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @Janice Chi

For catch-up CDC, recon is again post-ingestion, where each batch or logical window is validated after it fully lands in Hyperscale. is this one is right approach and why ?

Yes, validating each batch or logical window after it lands in Azure SQL Hyperscale for catch-up CDC is the correct and recommended approach, and here's why:

Post-ingestion validation is effective, deterministic, and aligns with logical data boundaries emitted by CDC. It reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found. It's a production-grade, scalable approach widely used in real-world data platforms.

What is the recommended approach for recon in real-time streaming pipelines, given that data flows continuously?

Real-time recon is best handled using micro-batch post-ingestion validation, while ensuring traceability via metadata.

Key Practices:

Use Kafka offset metadata (offset, partition, topic, commit timestamp) in every message ingested into Databricks Bronze tables.
Process via Databricks Structured Streaming using Delta architecture: Bronze → Silver → Hyperscale.
Store micro-batch stats:
1. min_offset, max_offset
2. record_count, record_hash
3. batch_id, timestamp
After sink into Azure SQL Hyperscale, validate by comparing these metrics with the target table.

Is recon after landing in Hyperscale acceptable? If yes, how should we do it?

It’s both acceptable and recommended. Real-time pipelines are designed for continuous data flow, and pausing them for mid-stream validation is impractical.

How to perform Recon:

Track Kafka metadata and hashes in Databricks control tables.
Use foreachBatch to write each micro-batch to Hyperscale and log offsets + record count.
Run a validation SQL job post-ingestion that: a. Joins the Hyperscale table with the control table. b. Compares row count, hash values. c. Logs the outcome to a Recon Table.
Send alert if mismatch and flag the offset range as failed.

What if a recon check fails? How to:

Identify the failed rows or partitions:
- Each micro-batch should log min_offset, max_offset, and batch_id into a control table.
- If the validation fails, the affected offset range or primary keys are marked as dirty.
- You can replay them from Kafka using the stored offsets.
Handle retries or rollbacks in hyperscale: To ensure idempotent writes to Azure SQL Hyperscale during real-time streaming, implement logic that prevents duplicate or inconsistent data. Use MERGE INTO statements to update or insert records based on a unique key (such as a primary key or composite business key). Additionally, enforce unique constraints and optionally apply hashing to detect changes at the row level. In case of reconciliation failures, trigger automated retry workflows using Azure Data Factory (ADF),allowing failed batches to be reprocessed without impacting already-ingested data. This approach ensures consistency, traceability, and resilience in your streaming pipeline.
Avoid Duplication and inconsistent state:
In the Silver layer of Databricks, use the dropDuplicates() function based on a combination of the primary key and commit timestamp to eliminate duplicate records and ensure data accuracy before loading into the final destination Maintain a staging table before writing to Azure SQL Hyperscale to provide a rollback point in case of failures or validation mismatches, enhancing data reliability and control in the streaming pipeline.

Architecture reference or best practices for recon in Kafka → Databricks → Hyperscale pipelines?

Microsoft Architecture Recommend from my side

Source: DB2 → IBM CDC → Kafka (GCP)

Stream Processor: Databricks Structured Streaming (Bronze, Silver layers)

Sink: Azure SQL Hyperscale via foreachBatch

Recon: Control + Audit tables tracking Kafka offsets, row counts, and hashes

Replay: On failure, use Kafka offset-based replay with replay logs

I hope this info helps, If any questions we are happy to help

Share via

Recon in NRT -Near Real Time

0 additional answers

Your answer