Hi @Janice Chi
For catch-up CDC, recon is again post-ingestion, where each batch or logical window is validated after it fully lands in Hyperscale. is this one is right approach and why ?
Yes, validating each batch or logical window after it lands in Azure SQL Hyperscale for catch-up CDC is the correct and recommended approach, and here's why:
Post-ingestion validation is effective, deterministic, and aligns with logical data boundaries emitted by CDC. It reduces the risk of checking incomplete or in-flight data, and allows for clean rollback/retry if mismatches are found. It's a production-grade, scalable approach widely used in real-world data platforms.
What is the recommended approach for recon in real-time streaming pipelines, given that data flows continuously?
Real-time recon is best handled using micro-batch post-ingestion validation, while ensuring traceability via metadata.
Key Practices:
- Use Kafka offset metadata (offset, partition, topic, commit timestamp) in every message ingested into Databricks Bronze tables.
- Process via Databricks Structured Streaming using Delta architecture: Bronze → Silver → Hyperscale.
- Store micro-batch stats:
- min_offset, max_offset
- record_count, record_hash
- batch_id, timestamp
- After sink into Azure SQL Hyperscale, validate by comparing these metrics with the target table.
Is recon after landing in Hyperscale acceptable? If yes, how should we do it?
It’s both acceptable and recommended. Real-time pipelines are designed for continuous data flow, and pausing them for mid-stream validation is impractical.
How to perform Recon:
- Track Kafka metadata and hashes in Databricks control tables.
- Use foreachBatch to write each micro-batch to Hyperscale and log offsets + record count.
- Run a validation SQL job post-ingestion that: a. Joins the Hyperscale table with the control table. b. Compares row count, hash values. c. Logs the outcome to a Recon Table.
- Send alert if mismatch and flag the offset range as failed.
What if a recon check fails? How to:
- Identify the failed rows or partitions:
- Each micro-batch should log min_offset, max_offset, and batch_id into a control table.
- If the validation fails, the affected offset range or primary keys are marked as dirty.
- You can replay them from Kafka using the stored offsets.
- Handle retries or rollbacks in hyperscale: To ensure idempotent writes to Azure SQL Hyperscale during real-time streaming, implement logic that prevents duplicate or inconsistent data. Use MERGE INTO statements to update or insert records based on a unique key (such as a primary key or composite business key). Additionally, enforce unique constraints and optionally apply hashing to detect changes at the row level. In case of reconciliation failures, trigger automated retry workflows using Azure Data Factory (ADF),allowing failed batches to be reprocessed without impacting already-ingested data. This approach ensures consistency, traceability, and resilience in your streaming pipeline.
- Avoid Duplication and inconsistent state:
In the Silver layer of Databricks, use the dropDuplicates() function based on a combination of the primary key and commit timestamp to eliminate duplicate records and ensure data accuracy before loading into the final destination Maintain a staging table before writing to Azure SQL Hyperscale to provide a rollback point in case of failures or validation mismatches, enhancing data reliability and control in the streaming pipeline.
Architecture reference or best practices for recon in Kafka → Databricks → Hyperscale pipelines?
Microsoft Architecture Recommend from my side
Source: DB2 → IBM CDC → Kafka (GCP)
Stream Processor: Databricks Structured Streaming (Bronze, Silver layers)
Sink: Azure SQL Hyperscale via foreachBatch
Recon: Control + Audit tables tracking Kafka offsets, row counts, and hashes
Replay: On failure, use Kafka offset-based replay with replay logs
I hope this info helps, If any questions we are happy to help