Best Practice for Scalable Catch-up CDC Framework – 800 Kafka Topics with Distinct Schemas from IBM DB2

Question

Best Practice for Scalable Catch-up CDC Framework – 800 Kafka Topics with Distinct Schemas from IBM DB2

Janice Chi 140

We are implementing a large-scale Change Data Capture (CDC) architecture with the following setup:

Source: IBM DB2

CDC Tool: IBM InfoSphere CDC publishes change events to Kafka, one topic per DB2 table.

Topics: ~800 Kafka topics, each representing one DB2 table with a distinct schema.

Target Architecture:

Bronze Layer (ADLS): Stores raw flattened JSON payloads per topic.

  **Silver Layer (Delta Lake):** Applies MERGE (Insert/Update/Delete) logic using keys from metadata.
  
     **Processing Engine:** All processing is done in **Databricks** (Spark).
     
        **Orchestration:** Azure Data Factory (ADF) drives the overall catch-up workflow.
        
        **Control Table:** A metadata-driven control table maintains per-topic details like:
        
           Kafka topic name
           
              Business unit
              
                 Target table name
                 
                    Start/End offset for catch-up
                    
                       Primary keys
                       
                          Partition info, etc.

❓ Questions to Microsoft:

What is the Microsoft-recommended approach to implement a scalable, maintainable catch-up CDC processing framework across 800 Kafka topics with differing schemas?

Should we:

Create 800 individual notebooks/scripts for Bronze ingestion and another 800 for Silver MERGE?

  Or is there a recommended **generic pattern** (e.g., notebook parameterization, reusable frameworks) to handle each topic/table dynamically?
  
  **How should we address the following challenges in a Microsoft-native or hybrid architecture:**
  
     Schema evolution handling per topic
     
        Parallelism and scale-out execution (especially in ADF or Databricks jobs)
        
           Managing dependencies, retries, and logging per-topic
           
              Operational simplicity for 800+ independent streams during catch-up mode

Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator

2025-06-11T17:19:49.84+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator

2025-06-11T17:19:49.84+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

@Janice Chi

Thanks for sharing the detailed overview of your CDC architecture, it's a complex yet very relevant scenario we see increasingly in large enterprise data platforms. You're absolutely right to focus on scalability, schema variability, and operational simplicity given the volume of Kafka topics you're dealing with.

Here are our recommendations for implementing a Microsoft-aligned, scalable, and maintainable CDC catch-up framework across your 800 Kafka topics:

Parameterized & Metadata-Driven Framework

Rather than maintaining 800 individual notebooks/scripts, we recommend using a small set of reusable, parameterized Databricks notebooks. These notebooks should ingest configuration from a metadata control table, which includes:

Kafka topic name
Schema location or inferred schema
Source/target table mapping
Primary keys
Start and end offsets for catch-up
Merge strategy

This way, a single notebook can dynamically process any topic based on input parameters passed via ADF or Databricks jobs.

Schema evolution handling

To address differing and evolving schemas:

Use Kafka Schema Registry (e.g., Confluent) to store and version your schemas.
Ingest the schema dynamically using tools like Spark’s schema_of_json() or by pulling from the registry.
Databricks Autoloader (if used) also supports schema inference and evolution with mergeSchema.

Maintain a versioned schema reference in your metadata table to control breaking changes and support rollback.

Parallelism & Scale-Out Execution

To scale processing for 800+ topics:

In Azure Data Factory, use the ForEach activity with controlled concurrency (based on compute resources).
Each ADF iteration can trigger a Databricks job with topic-specific parameters.
Use job clusters or multi-task jobs in Databricks for better cluster utilization and isolation.

This approach ensures high throughput and avoids bottlenecks during catch-up mode.

Dependency Management, Logging & Retries

For reliable operations:

Implement status tracking in your control table (e.g., Success, Failed, Retrying).
Use try-catch logic in Databricks notebooks to capture failures and update the control table.
Configure ADF retry logic or orchestrate custom retry pipelines for failed topics.
Route logs and metrics to Azure Monitor or Log Analytics for centralized observability.

Operational simplicity at Scale

By combining:

Metadata-driven orchestration
Reusable notebooks
Control table governance
ADF’s orchestration and scheduling capabilities

You can minimize manual touchpoints. Onboarding a new table or adapting to schema changes becomes as simple as updating a row in your control table, no additional code changes required.

Conclusion:

This design supports a robust, extensible ingestion framework capable of handling not just the 800 current topics but also scaling as more are added. It aligns with Microsoft best practices for modern data engineering using Azure-native tools and Databricks.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Share via

Best Practice for Scalable Catch-up CDC Framework – 800 Kafka Topics with Distinct Schemas from IBM DB2

1 answer

Your answer