Hi @Janice Chi
I see you are comparing Azure Data Factory (ADF) and Databricks for loading data into Azure SQL Hyperscale in different scenarios. Let's address your questions:
Q1: DB2 Partition Size vs. Kafka Offset Volume
Yes, your assumption is generally correct - DB2 table partitions will typically represent much larger data volumes than a given range of Kafka offsets.
- DB2 Partitions (especially in large on-prem systems) are usually defined on physical data characteristics - e.g., date ranges, row counts, or size thresholds. Each partition may hold millions of rows or several GBs/TBs.
- Kafka Offsets, by contrast, are message sequence numbers within a topic partition -typically fine-grained CDC change events (INSERT/UPDATE/DELETE) representing only deltas.
- So, even a range like offset
1 to 100
might represent just a few hundred records, while a single DB2 partition could hold billions of rows.
This distinction is key in understanding why your one-time load throughput requirements are drastically higher than those for streaming catch-up.
Q2: ADF for One-Time Load & Databricks for Catch-Up – Is That the Right Split?
Yes - your proposed separation of tools makes complete architectural sense. Here's why:
ADF for One-Time Historical Load (Batch)
- Best suited for large-scale batch ETL.
- Can leverage parallel copy, partitioned reads (with SHIR), and robust retry logic.
- Optimized for structured ingestion workflows, with cost-effective compute and easier orchestration.
- Supports DB2 on-prem extraction and Hyperscale ingestion in a highly manageable pipeline.
Best Practices:
- Use parallel copy activities with
source partitioning
to optimize throughput. - Tune batch size,
writeBatchTimeout
, and Hyperscale ingestion concurrency for performance. - Consider staging to ADLS if any transient transformation is needed before load.
Databricks for Catch-Up & Real-Time Ingestion (Streaming)
- Spark Structured Streaming is ideal for continuous CDC-based Kafka ingestion.
- Handles schema evolution, windowing, out-of-order events, and custom transformation logic.
- Supports autoscaling clusters, checkpointing, and can write directly to Hyperscale (via JDBC) or stage to Delta → Synapse/ADF.
Best Practices:
- Use trigger once or micro-batch mode if Hyperscale cannot ingest with low latency.
- Consider writing to staging Delta tables first, then bulk-load to Hyperscale to avoid JDBC bottlenecks.
- Monitor JDBC write performance to Hyperscale; use batch size tuning (e.g., 10K–50K rows per batch).
Recommendation:
Scenario | Tool | Rationale |
---|---|---|
One-time Historical Load | ADF | Efficient, scalable batch ETL with DB2 + SQL Hyperscale |
One-time Historical Load | ADF | Efficient, scalable batch ETL with DB2 + SQL Hyperscale |
CDC-based Catch-up / Real-time | Databricks | Flexible, streaming-capable, handles high EPS + Kafka |
I hope this information helps.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.