How to Detect and Handle Delayed CDC Events in Kafka During Transition from Catch-Up to Real-Time Streaming?

Janice Chi 100 Reputation points
2025-06-09T07:48:06.8766667+00:00

We are implementing a hybrid data ingestion pipeline from an on-prem IBM DB2 source into Azure SQL Hyperscale using Kafka as the transport layer and Databricks Structured Streaming as the consumer.

The pipeline has three stages:

  1. Historical Load – One-time bulk load from snapshot.
  2. Catch-Up Load – Offset-based CDC reprocessing from Kafka (post-snapshot but before streaming starts).
  3. Real-Time Streaming – Continuous ingestion from Kafka via Databricks using structured streaming.

The CDC data is published to Kafka by IBM InfoSphere CDC. Our concern is with delayed CDC events — i.e., changes that logically belong to the catch-up range (based on commit timestamp), but get published to Kafka after the real-time streaming pipeline has already started consuming newer offsets.

Our question:

What is the recommended approach on Azure to detect and capture such delayed CDC events, even if they are published to Kafka after the stream offset has advanced?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
{count} votes

1 answer

Sort by: Most helpful
  1. J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
    2025-06-09T09:46:44.0266667+00:00

    @Janice Chi

    In hybrid CDC pipelines involving Kafka and Azure Databricks, it's common to face delayed events records that are committed early (e.g., during the catch-up phase) but appear in Kafka only after the real-time streaming pipeline has started. These can be missed if the system only tracks Kafka offsets.  

    To reliably detect and capture delayed CDC events on Azure, follow this strategy:  

    Add a commit Timestamp to each CDC record 

    Ensure every Kafka message includes a commit_timestamp (or equivalent business timestamp) from the source system. This is critical because Kafka offsets are not time-ordered.  

    Define and store a cutoff Timestamp 

    Before switching from catch-up to real-time, capture the max commit timestamp from the catch-up load and store it in a config table (e.g., Delta table, Azure SQL, or Key Vault). This cutoff_time will act as a logical boundary.  

    Filter and divert late events in structured streaming 

    In your Databricks Structured Streaming job:  

    • Compare each record’s commit_timestamp with the cutoff_time.  
    • Route late events (commit_timestamp < cutoff_time) to a holding Delta table for delayed processing.  
    • Allow only current/future events to flow into your main target system (e.g., Azure SQL Hyperscale). 
    late_df = df.filter(col("commit_timestamp") < cutoff_time)  
    current_df = df.filter(col("commit_timestamp") >= cutoff_time) 
    
    

    Reconcile late events periodically 

    Use a scheduled Databricks job or Azure Data Factory pipeline to upsert delayed records from the holding table into the main target using MERGE INTO logic. This ensures data consistency.  

    Log and monitor the volume of delayed events. If the rate increases, it could indicate CDC lag or upstream issues.  

    Always trust commit timestamps, not Kafka offsets, for ordering and reconciliation. This makes your pipeline resilient to out-of-order or delayed CDC events and aligns with best practices in cloud-scale data engineering on Azure.  

    I hope this information helps. Please do let us know if you have any further queries. 

     

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.