optimize ETL implementation between CDC catch-up and real-time streaming

Janice Chi 140 Reputation points
2025-06-05T15:06:52.3033333+00:00

Question: We are working on a regulated data migration project where data flows from an on-prem IBM DB2 system through IBM InfoSphere CDC into Kafka on GCP, and is finally processed using Azure Databricks and written to Azure SQL Hyperscale. Azure Data Factory is used for orchestration.

Our architecture supports two distinct ingestion modes:

  1. Catch-up (CDC) using fixed offset ranges (batch-oriented)

Real-time streaming using watermark logic (structured streaming)

From a design and implementation perspective, we want to optimize development and maintenance effort by reusing as many components as possible across both ingestion modes — without compromising scalability or data correctness.

Could you please advise:

  • Which components can be safely reused across both batch and streaming pipelines?

Are there any performance or architectural risks if we share transformation and reconciliation modules across both?

Any best practices from Microsoft’s reference implementations or guidelines that support modular, mode-agnostic design in ETL using Databricks and Azure SQL?

We want to ensure consistency across both pipelines while still respecting the different trigger patterns, offset logic, and reconciliation frequency in CDC vs. streaming.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,516 questions
{count} votes

1 answer

Sort by: Most helpful
  1. J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator
    2025-06-05T15:36:10.23+00:00

    Hi @Janice Chi
    To provide more appropriate advice, here are a few follow-up questions help with answers 

    1. What data volume or load are you expecting for both batch and streaming? This may impact how components should be optimized or designed. 
    2. Are there specific transformation operations that you find complex or error-prone in your current implementation? 
    3. Do you have existing performance metrics from your current setup that might indicate areas for improvement? 
    4. How critical is real-time processing in your application? Would minor delays in the streaming pipeline be acceptable? 

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.