ADF vs Databricks for Load in ETL into Hyperscale

Janice Chi 140 Reputation points
2025-06-18T07:28:38.9333333+00:00

We are working on a highly sensitive healthcare data migration project involving:

Source: IBM DB2 (on-prem) with partitioned tables (up to 18 TB in size).

CDC: IBM InfoSphere CDC → Kafka Topics (on GCP).

Target: Azure SQL Hyperscale.

There are two ingestion modes:

  1. One-Time Historical Migration: 79 TB total across 790 tables, using ADF to extract from DB2 and load into Hyperscale, with Databricks only doing minor transformations.
  2. Catch-Up & Real-Time Streaming: CDC events flow continuously through Kafka (390 EPS avg, 39,000 EPS peak), and must be loaded into Hyperscale after transformation.

Questions:

We’ve observed that:

DB2 table partitions (based on data size) do not align with Kafka topic partitions (based on offsets).

  In one-time load, we read full partitions from DB2.

  
     In catch-up, we read CDC data using Kafka offsets.

     
     **Q1:** Is it correct to assume that the size of a DB2 table partition (in one-time load) will almost always be much **larger** than the volume of data associated with a range of Kafka offsets (e.g., offset 1 to 100) in catch-up ingestion? Could you please explain the technical reasoning?
```1. For loading data into Azure SQL Hyperscale:

   - In the **one-time load**, we’re currently using ADF.
   
   - For **catch-up**, we’re considering Databricks due to streaming nature.
   
   **Q2:** Should we continue using ADF for the one-time load and switch to Databricks for catch-up and real-time ingestion? What are the best practices, performance trade-offs, and scaling limitations in deciding ADF vs. Databricks for writing into Hyperscale across both scenarios?We are working on a highly sensitive healthcare data migration project involving:
   
   - **Source**: IBM DB2 (on-prem) with partitioned tables (up to 18 TB in size).
   
   - **CDC**: IBM InfoSphere CDC → Kafka Topics (on GCP).
   
   - **Target**: Azure SQL Hyperscale.
   
   There are two ingestion modes:
   
   1. **One-Time Historical Migration**: 80 TB total across 800 tables, using ADF to extract from DB2 and load into Hyperscale, with Databricks only doing minor transformations.
   
   1. **Catch-Up & Real-Time Streaming**: CDC events flow continuously through Kafka (390 EPS avg, 39,000 EPS peak), and must be loaded into Hyperscale after transformation.
   
   Questions:
   
   1. We’ve observed that:
   
      - DB2 table partitions (based on data size) do **not** align with Kafka topic partitions (based on offsets).
      
      - In one-time load, we read full partitions from DB2.
      
      - In catch-up, we read CDC data using Kafka offsets.
      
      **Q1:** Is it correct to assume that the size of a DB2 table partition (in one-time load) will almost always be much **larger** than the volume of data associated with a range of Kafka offsets (e.g., offset 1 to 100) in catch-up ingestion? Could you please explain the technical reasoning?
      
   1. For loading data into Azure SQL Hyperscale:
   
      - In the **one-time load**, we’re currently using ADF.
      
      - For **catch-up**, we’re considering Databricks due to streaming nature.
      
      **Q2:** Should we continue using ADF for the one-time load and switch to Databricks for catch-up and real-time ingestion? What are the best practices, performance trade-offs, and scaling limitations in deciding ADF vs. Databricks for writing into Hyperscale across both scenarios?
      
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2025-06-18T08:10:44.63+00:00

    Hi @Janice Chi
    I see you are comparing Azure Data Factory (ADF) and Databricks for loading data into Azure SQL Hyperscale in different scenarios. Let's address your questions:

    Q1: DB2 Partition Size vs. Kafka Offset Volume

    Yes, your assumption is generally correct - DB2 table partitions will typically represent much larger data volumes than a given range of Kafka offsets.

    • DB2 Partitions (especially in large on-prem systems) are usually defined on physical data characteristics - e.g., date ranges, row counts, or size thresholds. Each partition may hold millions of rows or several GBs/TBs.
    • Kafka Offsets, by contrast, are message sequence numbers within a topic partition -typically fine-grained CDC change events (INSERT/UPDATE/DELETE) representing only deltas.
    • So, even a range like offset 1 to 100 might represent just a few hundred records, while a single DB2 partition could hold billions of rows.

    This distinction is key in understanding why your one-time load throughput requirements are drastically higher than those for streaming catch-up.

    Q2: ADF for One-Time Load & Databricks for Catch-Up – Is That the Right Split?

    Yes - your proposed separation of tools makes complete architectural sense. Here's why:

    ADF for One-Time Historical Load (Batch)

    1. Best suited for large-scale batch ETL.
    2. Can leverage parallel copy, partitioned reads (with SHIR), and robust retry logic.
    3. Optimized for structured ingestion workflows, with cost-effective compute and easier orchestration.
    4. Supports DB2 on-prem extraction and Hyperscale ingestion in a highly manageable pipeline.

    Best Practices:

    1. Use parallel copy activities with source partitioning to optimize throughput.
    2. Tune batch size, writeBatchTimeout, and Hyperscale ingestion concurrency for performance.
    3. Consider staging to ADLS if any transient transformation is needed before load.

    Databricks for Catch-Up & Real-Time Ingestion (Streaming)

    1. Spark Structured Streaming is ideal for continuous CDC-based Kafka ingestion.
    2. Handles schema evolution, windowing, out-of-order events, and custom transformation logic.
    3. Supports autoscaling clusters, checkpointing, and can write directly to Hyperscale (via JDBC) or stage to Delta → Synapse/ADF.

    Best Practices:

    1. Use trigger once or micro-batch mode if Hyperscale cannot ingest with low latency.
    2. Consider writing to staging Delta tables first, then bulk-load to Hyperscale to avoid JDBC bottlenecks.
    3. Monitor JDBC write performance to Hyperscale; use batch size tuning (e.g., 10K–50K rows per batch).

    Recommendation:

    Scenario Tool Rationale
    One-time Historical Load ADF Efficient, scalable batch ETL with DB2 + SQL Hyperscale
    One-time Historical Load ADF Efficient, scalable batch ETL with DB2 + SQL Hyperscale
    CDC-based Catch-up / Real-time Databricks Flexible, streaming-capable, handles high EPS + Kafka

    I hope this information helps.


    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.