ADF vs Databricks for Load in ETL into Hyperscale

Question

ADF vs Databricks for Load in ETL into Hyperscale

Janice Chi 140

We are working on a highly sensitive healthcare data migration project involving:

Source: IBM DB2 (on-prem) with partitioned tables (up to 18 TB in size).

CDC: IBM InfoSphere CDC → Kafka Topics (on GCP).

Target: Azure SQL Hyperscale.

There are two ingestion modes:

One-Time Historical Migration: 79 TB total across 790 tables, using ADF to extract from DB2 and load into Hyperscale, with Databricks only doing minor transformations.
Catch-Up & Real-Time Streaming: CDC events flow continuously through Kafka (390 EPS avg, 39,000 EPS peak), and must be loaded into Hyperscale after transformation.

Questions:

We’ve observed that:

DB2 table partitions (based on data size) do not align with Kafka topic partitions (based on offsets).

  In one-time load, we read full partitions from DB2.

  
     In catch-up, we read CDC data using Kafka offsets.

     
     **Q1:** Is it correct to assume that the size of a DB2 table partition (in one-time load) will almost always be much **larger** than the volume of data associated with a range of Kafka offsets (e.g., offset 1 to 100) in catch-up ingestion? Could you please explain the technical reasoning?
```1. For loading data into Azure SQL Hyperscale:

   - In the **one-time load**, we’re currently using ADF.
   
   - For **catch-up**, we’re considering Databricks due to streaming nature.
   
   **Q2:** Should we continue using ADF for the one-time load and switch to Databricks for catch-up and real-time ingestion? What are the best practices, performance trade-offs, and scaling limitations in deciding ADF vs. Databricks for writing into Hyperscale across both scenarios?We are working on a highly sensitive healthcare data migration project involving:
   
   - **Source**: IBM DB2 (on-prem) with partitioned tables (up to 18 TB in size).
   
   - **CDC**: IBM InfoSphere CDC → Kafka Topics (on GCP).
   
   - **Target**: Azure SQL Hyperscale.
   
   There are two ingestion modes:
   
   1. **One-Time Historical Migration**: 80 TB total across 800 tables, using ADF to extract from DB2 and load into Hyperscale, with Databricks only doing minor transformations.
   
   1. **Catch-Up & Real-Time Streaming**: CDC events flow continuously through Kafka (390 EPS avg, 39,000 EPS peak), and must be loaded into Hyperscale after transformation.
   
   Questions:
   
   1. We’ve observed that:
   
      - DB2 table partitions (based on data size) do **not** align with Kafka topic partitions (based on offsets).
      
      - In one-time load, we read full partitions from DB2.
      
      - In catch-up, we read CDC data using Kafka offsets.
      
      **Q1:** Is it correct to assume that the size of a DB2 table partition (in one-time load) will almost always be much **larger** than the volume of data associated with a range of Kafka offsets (e.g., offset 1 to 100) in catch-up ingestion? Could you please explain the technical reasoning?
      
   1. For loading data into Azure SQL Hyperscale:
   
      - In the **one-time load**, we’re currently using ADF.
      
      - For **catch-up**, we’re considering Databricks due to streaming nature.
      
      **Q2:** Should we continue using ADF for the one-time load and switch to Databricks for catch-up and real-time ingestion? What are the best practices, performance trade-offs, and scaling limitations in deciding ADF vs. Databricks for writing into Hyperscale across both scenarios?

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-19T10:05:02.3433333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-19T10:05:02.3433333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @Janice Chi
I see you are comparing Azure Data Factory (ADF) and Databricks for loading data into Azure SQL Hyperscale in different scenarios. Let's address your questions:

Q1: DB2 Partition Size vs. Kafka Offset Volume

Yes, your assumption is generally correct - DB2 table partitions will typically represent much larger data volumes than a given range of Kafka offsets.

DB2 Partitions (especially in large on-prem systems) are usually defined on physical data characteristics - e.g., date ranges, row counts, or size thresholds. Each partition may hold millions of rows or several GBs/TBs.
Kafka Offsets, by contrast, are message sequence numbers within a topic partition -typically fine-grained CDC change events (INSERT/UPDATE/DELETE) representing only deltas.
So, even a range like offset 1 to 100 might represent just a few hundred records, while a single DB2 partition could hold billions of rows.

This distinction is key in understanding why your one-time load throughput requirements are drastically higher than those for streaming catch-up.

Q2: ADF for One-Time Load & Databricks for Catch-Up – Is That the Right Split?

Yes - your proposed separation of tools makes complete architectural sense. Here's why:

ADF for One-Time Historical Load (Batch)

Best suited for large-scale batch ETL.
Can leverage parallel copy, partitioned reads (with SHIR), and robust retry logic.
Optimized for structured ingestion workflows, with cost-effective compute and easier orchestration.
Supports DB2 on-prem extraction and Hyperscale ingestion in a highly manageable pipeline.

Best Practices:

Use parallel copy activities with source partitioning to optimize throughput.
Tune batch size, writeBatchTimeout, and Hyperscale ingestion concurrency for performance.
Consider staging to ADLS if any transient transformation is needed before load.

Databricks for Catch-Up & Real-Time Ingestion (Streaming)

Spark Structured Streaming is ideal for continuous CDC-based Kafka ingestion.
Handles schema evolution, windowing, out-of-order events, and custom transformation logic.
Supports autoscaling clusters, checkpointing, and can write directly to Hyperscale (via JDBC) or stage to Delta → Synapse/ADF.

Best Practices:

Use trigger once or micro-batch mode if Hyperscale cannot ingest with low latency.
Consider writing to staging Delta tables first, then bulk-load to Hyperscale to avoid JDBC bottlenecks.
Monitor JDBC write performance to Hyperscale; use batch size tuning (e.g., 10K–50K rows per batch).

Recommendation:

Scenario	Tool	Rationale
One-time Historical Load	ADF	Efficient, scalable batch ETL with DB2 + SQL Hyperscale
One-time Historical Load	ADF	Efficient, scalable batch ETL with DB2 + SQL Hyperscale
CDC-based Catch-up / Real-time	Databricks	Flexible, streaming-capable, handles high EPS + Kafka

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Janice Chi 140 Reputation points

2025-06-18T08:41:20.47+00:00

let's suppose that DB2 table partition size = kafka topic partition size between one time historcial migration and catch up so in that case what we should use for load in one time and catach up . ADF in both or what and why ? Is Size and count of records is deciding factor or there is another factor as well which decides between ADF and DBR for load ? in our case partitions in kafka are more in number than actual table partitions for exxample in table T1 has 28 table partions and 64 kafka partitions while Table 2 has 10 table partitions and 100 Kafka partitions

Factor	One-Time Load (Historical)	Catch-Up / Streaming (CDC)
Ingestion Pattern	Bulk/Batch	Real-time or micro-batch
Ingestion Pattern	Bulk/Batch	Real-time or micro-batch
Best Fit Tool	ADF	Databricks
Why	Optimized for high-throughput batch copy; minimal compute required	Optimized for continuous ingestion, checkpointing, schema evolution, offset management
Complex Logic/Transformations?	Usually limited	Often dynamic, evolving with CDC events
Scalability Need	Parallel Copy across partitions	Autoscaling clusters and streaming-aware transformations
Resilience	Built-in retries, failure tolerance	Built-in recovery from Kafka offsets

Share via

ADF vs Databricks for Load in ETL into Hyperscale

1 answer

Your answer