Best Practice for Scalable Catch-up CDC Framework – 800 Kafka Topics with Distinct Schemas from IBM DB2

Janice Chi 140 Reputation points
2025-06-10T16:51:13.0166667+00:00

We are implementing a large-scale Change Data Capture (CDC) architecture with the following setup:

  • Source: IBM DB2

CDC Tool: IBM InfoSphere CDC publishes change events to Kafka, one topic per DB2 table.

Topics: ~800 Kafka topics, each representing one DB2 table with a distinct schema.

Target Architecture:

Bronze Layer (ADLS): Stores raw flattened JSON payloads per topic.

  **Silver Layer (Delta Lake):** Applies MERGE (Insert/Update/Delete) logic using keys from metadata.
  
     **Processing Engine:** All processing is done in **Databricks** (Spark).
     
        **Orchestration:** Azure Data Factory (ADF) drives the overall catch-up workflow.
        
        **Control Table:** A metadata-driven control table maintains per-topic details like:
        
           Kafka topic name
           
              Business unit
              
                 Target table name
                 
                    Start/End offset for catch-up
                    
                       Primary keys
                       
                          Partition info, etc.
                          

❓ Questions to Microsoft:

What is the Microsoft-recommended approach to implement a scalable, maintainable catch-up CDC processing framework across 800 Kafka topics with differing schemas?

Should we:

Create 800 individual notebooks/scripts for Bronze ingestion and another 800 for Silver MERGE?

  Or is there a recommended **generic pattern** (e.g., notebook parameterization, reusable frameworks) to handle each topic/table dynamically?
  
  **How should we address the following challenges in a Microsoft-native or hybrid architecture:**
  
     Schema evolution handling per topic
     
        Parallelism and scale-out execution (especially in ADF or Databricks jobs)
        
           Managing dependencies, retries, and logging per-topic
           
              Operational simplicity for 800+ independent streams during catch-up mode
              

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,516 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator
    2025-06-10T18:16:48.35+00:00

    @Janice Chi

    Thanks for sharing the detailed overview of your CDC architecture, it's a complex yet very relevant scenario we see increasingly in large enterprise data platforms. You're absolutely right to focus on scalability, schema variability, and operational simplicity given the volume of Kafka topics you're dealing with.

    Here are our recommendations for implementing a Microsoft-aligned, scalable, and maintainable CDC catch-up framework across your 800 Kafka topics:

    Parameterized & Metadata-Driven Framework

    Rather than maintaining 800 individual notebooks/scripts, we recommend using a small set of reusable, parameterized Databricks notebooks. These notebooks should ingest configuration from a metadata control table, which includes:

    • Kafka topic name
    • Schema location or inferred schema
    • Source/target table mapping
    • Primary keys
    • Start and end offsets for catch-up
    • Merge strategy

    This way, a single notebook can dynamically process any topic based on input parameters passed via ADF or Databricks jobs.

    Schema evolution handling

    To address differing and evolving schemas:

    • Use Kafka Schema Registry (e.g., Confluent) to store and version your schemas.
    • Ingest the schema dynamically using tools like Spark’s schema_of_json() or by pulling from the registry.
    • Databricks Autoloader (if used) also supports schema inference and evolution with mergeSchema.

    Maintain a versioned schema reference in your metadata table to control breaking changes and support rollback.

    Parallelism & Scale-Out Execution

    To scale processing for 800+ topics:

    • In Azure Data Factory, use the ForEach activity with controlled concurrency (based on compute resources).
    • Each ADF iteration can trigger a Databricks job with topic-specific parameters.
    • Use job clusters or multi-task jobs in Databricks for better cluster utilization and isolation.

    This approach ensures high throughput and avoids bottlenecks during catch-up mode.

    Dependency Management, Logging & Retries

    For reliable operations:

    • Implement status tracking in your control table (e.g., Success, Failed, Retrying).
    • Use try-catch logic in Databricks notebooks to capture failures and update the control table.
    • Configure ADF retry logic or orchestrate custom retry pipelines for failed topics.
    • Route logs and metrics to Azure Monitor or Log Analytics for centralized observability.

    Operational simplicity at Scale

    By combining:

    • Metadata-driven orchestration
    • Reusable notebooks
    • Control table governance
    • ADF’s orchestration and scheduling capabilities

    You can minimize manual touchpoints. Onboarding a new table or adapting to schema changes becomes as simple as updating a row in your control table, no additional code changes required.

    Conclusion:

    This design supports a robust, extensible ingestion framework capable of handling not just the 800 current topics but also scaling as more are added. It aligns with Microsoft best practices for modern data engineering using Azure-native tools and Databricks.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.