In hybrid CDC pipelines involving Kafka and Azure Databricks, it's common to face delayed events records that are committed early (e.g., during the catch-up phase) but appear in Kafka only after the real-time streaming pipeline has started. These can be missed if the system only tracks Kafka offsets.
To reliably detect and capture delayed CDC events on Azure, follow this strategy:
Add a commit Timestamp to each CDC record
Ensure every Kafka message includes a commit_timestamp (or equivalent business timestamp) from the source system. This is critical because Kafka offsets are not time-ordered.
Define and store a cutoff Timestamp
Before switching from catch-up to real-time, capture the max commit timestamp from the catch-up load and store it in a config table (e.g., Delta table, Azure SQL, or Key Vault). This cutoff_time will act as a logical boundary.
Filter and divert late events in structured streaming
In your Databricks Structured Streaming job:
- Compare each record’s commit_timestamp with the cutoff_time.
- Route late events (commit_timestamp < cutoff_time) to a holding Delta table for delayed processing.
- Allow only current/future events to flow into your main target system (e.g., Azure SQL Hyperscale).
late_df = df.filter(col("commit_timestamp") < cutoff_time)
current_df = df.filter(col("commit_timestamp") >= cutoff_time)
Reconcile late events periodically
Use a scheduled Databricks job or Azure Data Factory pipeline to upsert delayed records from the holding table into the main target using MERGE INTO logic. This ensures data consistency.
Log and monitor the volume of delayed events. If the rate increases, it could indicate CDC lag or upstream issues.
Always trust commit timestamps, not Kafka offsets, for ordering and reconciliation. This makes your pipeline resilient to out-of-order or delayed CDC events and aligns with best practices in cloud-scale data engineering on Azure.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.