Question: We are working on a regulated data migration project where data flows from an on-prem IBM DB2 system through IBM InfoSphere CDC into Kafka on GCP, and is finally processed using Azure Databricks and written to Azure SQL Hyperscale. Azure Data Factory is used for orchestration.
Our architecture supports two distinct ingestion modes:
- Catch-up (CDC) using fixed offset ranges (batch-oriented)
Real-time streaming using watermark logic (structured streaming)
From a design and implementation perspective, we want to optimize development and maintenance effort by reusing as many components as possible across both ingestion modes — without compromising scalability or data correctness.
Could you please advise:
- Which components can be safely reused across both batch and streaming pipelines?
Are there any performance or architectural risks if we share transformation and reconciliation modules across both?
Any best practices from Microsoft’s reference implementations or guidelines that support modular, mode-agnostic design in ETL using Databricks and Azure SQL?
We want to ensure consistency across both pipelines while still respecting the different trigger patterns, offset logic, and reconciliation frequency in CDC vs. streaming.