optimize ETL implementation between CDC catch-up and real-time streaming

Question

optimize ETL implementation between CDC catch-up and real-time streaming

Anonymous

Question: We are working on a regulated data migration project where data flows from an on-prem IBM DB2 system through IBM InfoSphere CDC into Kafka on GCP, and is finally processed using Azure Databricks and written to Azure SQL Hyperscale. Azure Data Factory is used for orchestration.

Our architecture supports two distinct ingestion modes:

Catch-up (CDC) using fixed offset ranges (batch-oriented)

Real-time streaming using watermark logic (structured streaming)

From a design and implementation perspective, we want to optimize development and maintenance effort by reusing as many components as possible across both ingestion modes — without compromising scalability or data correctness.

Could you please advise:

Which components can be safely reused across both batch and streaming pipelines?

Are there any performance or architectural risks if we share transformation and reconciliation modules across both?

Any best practices from Microsoft’s reference implementations or guidelines that support modular, mode-agnostic design in ETL using Databricks and Azure SQL?

We want to ensure consistency across both pipelines while still respecting the different trigger patterns, offset logic, and reconciliation frequency in CDC vs. streaming.

Chandra Boorla 15,475 Reputation points Microsoft External Staff Moderator

2025-06-09T21:08:18.0133333+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Chandra Boorla 15,475 Reputation points Microsoft External Staff Moderator

2025-06-09T21:08:18.0133333+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Anonymous

Hi @Janice Chi
To provide more appropriate advice, here are a few follow-up questions help with answers

What data volume or load are you expecting for both batch and streaming? This may impact how components should be optimized or designed.
Are there specific transformation operations that you find complex or error-prone in your current implementation?
Do you have existing performance metrics from your current setup that might indicate areas for improvement?
How critical is real-time processing in your application? Would minor delays in the streaming pipeline be acceptable?

Anonymous

2025-06-05T17:10:12.79+00:00
Q1. What data volume or load are you expecting for both batch and streaming?

Answer:

Batch (Catch-Up): We are handling 88 TB of total historical volume, with 15 large tables contributing ~70 TB. In catch-up mode, each batch may range from 100 GB to 1 TB, depending on the Kafka offset window defined per table.

Streaming: Expected load is 3600–29,000 CDC events/sec across 891 Kafka topics (each mapped to one table). Each micro-batch in streaming could carry several MBs to a few GBs depending on burst size and watermark interval.

Q2. Are there specific transformation operations that you find complex or error-prone in your current implementation?

Answer: Yes, a few areas are transformation-sensitive:

Deduplication using CDC flags (I/U/D) and composite keys is critical and error-prone during late-arriving messages in streaming.

Partition logic is complex, especially for large tables with skewed data when pushed into Hyperscale.

Data type normalization and string trimming/padding mismatches between DB2 and Azure SQL Hyperscale can introduce silent errors.

In streaming, idempotency logic (MERGE) must be precise to avoid duplicate inserts due to replays from Kafka.

Q3. Do you have existing performance metrics from your current setup that might indicate areas for improvement?

Answer: Preliminary metrics (from batch POC runs):

Kafka Read Throughput (batch): ~200 MB/sec per Databricks executor

Write to Hyperscale (via JDBC): ~5–10 million rows/hour per table using upsert logic with index tuning

Bronze-to-Silver latency: ~8–15 mins per batch depending on volume

Areas flagged for improvement:

Slowdowns when joining with enrichment tables inside Spark

Occasional JDBC bottlenecks during simultaneous writes to Hyperscale across large tables

Metadata overhead in ADF pipelines adds orchestration latency in batch mode

Q4. How critical is real-time processing in your application? Would minor delays in the streaming pipeline be acceptable?

Answer:

Real-time is important but not ultra-strict. The use case is operational decision-making, not transactional fraud detection.

A delay of 1–2 minutes per watermark window is acceptable. However, accuracy and auditability are non-negotiable, meaning recon and retry logic must be solid.

Data correctness and reconciliation are more important than sub-second latency.

Let me know if you want these stitched into an email format or included in your design document as assumptions and justifications.Excellent — here are model answers to those follow-up questions, fully aligned with our healthcare project involving DB2, Kafka (GCP), Databricks, and Azure SQL Hyperscale:

Q1. What data volume or load are you expecting for both batch and streaming?

Answer:

Batch (Catch-Up):
We are handling 80 TB of total historical volume, with 17 large tables contributing ~70 TB. In catch-up mode, each batch may range from 100 GB to 1 TB, depending on the Kafka offset window defined per table.

Streaming:
Expected load is 3000–24,000 CDC events/sec across 800 Kafka topics (each mapped to one table). Each micro-batch in streaming could carry several MBs to a few GBs depending on burst size and watermark interval.

Q2. Are there specific transformation operations that you find complex or error-prone in your current implementation?

Answer:
Yes, a few areas are transformation-sensitive:

Deduplication using CDC flags (I/U/D) and composite keys is critical and error-prone during late-arriving messages in streaming.

Partition logic is complex, especially for large tables with skewed data when pushed into Hyperscale.

Data type normalization and string trimming/padding mismatches between DB2 and Azure SQL Hyperscale can introduce silent errors.

In streaming, idempotency logic (MERGE) must be precise to avoid duplicate inserts due to replays from Kafka.

Q3. Do you have existing performance metrics from your current setup that might indicate areas for improvement?

Answer:
Preliminary metrics (from batch POC runs):

Kafka Read Throughput (batch): ~200 MB/sec per Databricks executor

Write to Hyperscale (via JDBC): ~5–10 million rows/hour per table using upsert logic with index tuning

Bronze-to-Silver latency: ~8–15 mins per batch depending on volume

Areas flagged for improvement:

Slowdowns when joining with enrichment tables inside Spark

Occasional JDBC bottlenecks during simultaneous writes to Hyperscale across large tables

Metadata overhead in ADF pipelines adds orchestration latency in batch mode

Q4. How critical is real-time processing in your application? Would minor delays in the streaming pipeline be acceptable?

Answer:

Real-time is important but not ultra-strict.
The use case is operational decision-making, not transactional fraud detection.

A delay of 1–2 minutes per watermark window is acceptable.
However, accuracy and auditability are non-negotiable, meaning recon and retry logic must be solid.

Data correctness and reconciliation are more important than sub-second latency.
Smaran Thoomu 35,125 Reputation points Microsoft External Staff Moderator

2025-06-06T04:56:24.87+00:00
@Janice Chi Thanks for the details. Its very helpful to understand both the architecture and your goals around optimizing for reuse between your CDC catch-up and real-time streaming pipelines.
Let me walk through each of your key questions and provide guidance with that in mind:

Which components can be safely reused across both batch and streaming pipelines?

You're definitely on the right track thinking modular. Here are some components that are typically reusable across both ingestion modes:

Transformation logic: If you abstract your parsing, column renaming, deduplication, and type casting into modular Spark functions or helper notebooks, you can apply them in both batch and streaming jobs.

Schema enforcement and type normalization: Common especially with DB2 → Azure SQL mappings (e.g., trimming, padding, decimals).

CDC flag handling: Your I/U/D logic using composite keys can be wrapped into reusable UDFs or Delta Lake MERGE strategies.

Error logging, validation, and enrichment logic: Can be reused with parameterization (e.g., whether to run with full lookup join or broadcast join depending on mode).

Delta table outputs (Bronze/Silver): If you're using Delta Lake, you can unify batch and streaming writes - both support MERGE/upserts.

Any performance or architectural risks if we share transformation and reconciliation modules across both?

There are a few important ones to be aware of:

Offset logic differs: Batch uses predefined Kafka offset windows, while streaming uses watermarking. Any shared component must respect how “completeness” is defined in each mode.

Idempotency risks in streaming: Late or replayed Kafka events can lead to duplicates unless your MERGE logic is solid. You’ll need to ensure your logic can handle reprocessing safely - batch is more forgiving here.

Resource contention: Large batch jobs and real-time micro-batches can stress Hyperscale if they hit simultaneously. JDBC writes especially - better to tune concurrency and retry logic differently.

Reconciliation granularity: In batch, recon might be row-level or table-level post-load. In streaming, you may need recon at the micro-batch level and checkpointing.

So yes, transformation logic can often be shared, but orchestration, retries, and reconciliation must be mode aware.

Best practices from Microsoft or Databricks for modular, mode-agnostic ETL design?

Here are a few:

Parameterize ingestion notebooks: Pass in mode (catchup vs streaming), table name, and other metadata. Let the same notebook handle both with slight branching.

Use Delta Lake as a unifying format: Write batch and streaming both to Bronze, process to Silver using shared logic.

Build shared helper libraries/notebooks:

apply_schema(df)

deduplicate(df, primary_keys)

enrich_with_reference_data(df)

Abstract write logic with retry policies: JDBC to Hyperscale often needs batching, retry with exponential backoff, and attention to indexes.

Structured Streaming tips:

Use Trigger.Once for batch-like runs.

Checkpoint frequently and test replay scenarios.

Monitor watermark lag and micro-batch execution times.

Reference: You might find Microsoft’s Medallion Architecture a good guiding principle - works equally well for CDC and real-time if structured carefully.

Given the scale you're working with (88 TB historical data and up to 29,000 CDC events/sec), your hybrid ingestion design is well justified. The key transformation challenges you mentioned - such as deduplication logic, partition skew, and data type mismatches - are common but manageable with modularization.

Since your real-time processing requirement is flexible (1–2 minutes delay acceptable), it’s advisable to prioritize data correctness and reconciliation first, rather than sub-second latency.

You may consider aligning both ingestion modes using a unified micro-batch architecture with watermark logic, allowing maximum reuse of transformation, validation, and write components.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Anonymous

2025-06-06T09:14:16.32+00:00

what all will change above and how if instead of 1-2 minutes delay it will be 1-10 secs only ---------------------------aslo explian this in more detail -You may consider aligning both ingestion modes using a unified micro-batch architecture with watermark logic, allowing maximum reuse of transformation, validation, and write components.
Smaran Thoomu 35,125 Reputation points Microsoft External Staff Moderator

2025-06-09T05:03:29.3733333+00:00
Hi @Janice Chi
To clarify your follow-up:

If your latency requirement is now 1–10 seconds instead of 1–2 minutes:

This changes the equation. You’re no longer in relaxed operational streaming territory - you’re approaching near-real-time SLA.

Key changes needed:

Trigger interval in your streaming job must be reduced to 1–5 seconds.

Cluster sizing needs to be increased or tightly auto scaled to handle small micro-batches with high frequency.

JDBC writes to Hyperscale may become a bottleneck — at this frequency, write buffering or alternate sync strategies may be required.

Deduplication, idempotency, and watermarking logic must be extremely efficient. You can’t afford retries or slow merge performance under 10s SLA.

If your existing CDC catch-up logic wasn’t built with this SLA in mind, you’ll need to refactor those pieces for low-latency readiness - there’s no shortcut.

On the “Unified Micro-Batch Architecture with Watermark Logic”:

What it means in practice:

Use Structured Streaming for both batch and real-time (just switch the trigger type).

Trigger.Once() for catch-up.

Trigger.ProcessingTime("5 seconds") for real-time.

Same logic, different parameters:

Same notebook/codebase can run both modes by passing flags like mode=catchup|streaming.

Watermarks help ensure correctness for both:

For batch, they're just boundaries.

For streaming, they help with late data control and deduplication.

Example: Instead of writing two separate pipelines, you write one ingestion notebook with parameterized logic:

trigger = Trigger.Once() if mode == "catchup" else Trigger.ProcessingTime("5 seconds")

This allows shared code for:

schema enforcement

deduplication

enrichment

reconciliation

output writes

If this isn't acceptable, then you’ll need to build separate pipelines per mode - but that’s a conscious trade-off between latency and maintainability.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

optimize ETL implementation between CDC catch-up and real-time streaming

1 answer

Your answer