Guidance on Best Practices for Loading CDC-Merged Data from ADLS to Azure SQL Hyperscale

Janice Chi 140 Reputation points
2025-06-21T09:23:23.7866667+00:00

As part of our healthcare data modernization initiative, we are currently implementing the Catch-Up phase of a large-scale CDC pipeline. The architecture involves:

  • All Insert/Update/Delete logic (CDC merge) being handled in the Databricks branch layer (ADLS Gen2)

The resulting merged Delta data is to be loaded into Azure SQL Hyperscale

The main Hyperscale table already contains full historical data from a previous one-time load (80TB)

We are reaching out to confirm Microsoft-recommended best practices for the following scenario:


Question:

Given that all CDC merge logic is completed in Databricks and the data in ADLS is already final and de-duplicated, what is the recommended and performance-optimal approach for loading merged Delta data from ADLS to Azure SQL Hyperscale?

Specifically:

Is it recommended to perform a direct batch insert (via Databricks JDBC) into Hyperscale main table without using an intermediate staging table in this case?

Are there size limits, concurrency concerns, or tuning guidelines we should follow when using Spark JDBC for large volumes (partitioned by Kafka offset windows)?

Can Microsoft suggest checkpointing or retry-safe patterns that are compatible with ADLS + Databricks → Hyperscale ingestion for CDC-based pipelines?

Should Hyperscale be prepared with any specific indexing or minimal constraints to ensure optimal write performance from Spark?

We are aiming for a pattern that is idempotent, retry-safe, and aligned with reconciliation — where control tables track offset window status per topic + partition.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 34,661 Reputation points MVP Volunteer Moderator
    2025-06-21T11:33:40.5333333+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    Your architecture aligns well with common modern data modernization patterns, especially in the healthcare sector, where historical and CDC driven changes are being unified. The following are couple of Microsoft & real world tested best practices for optimal performance, idempotency, and reliability.

    Use of Staging Table: Recommended

    Why: Even if your Delta output is deduplicated and finalized, using an intermediate staging table (in Hyperscale) provides the following benefits:

    • Supports idempotent retry logic (load -> validate -> merge -> commit).

    Enables bulk insert operations without locking or impacting the main table.

    • Allows partition wise reconciliation using Kafka offset metadata.

    Best Practice:

    Create a staging table schema identical to the main table with no constraints (no PKs, minimal indexes).

    • Include a metadata column (kafka_topic, partition, offset_window) to track CDC batch context.

    Insert Mode from Databricks: Via JDBC with Partitioning

    Recommended Pattern: Use Spark JDBC + batch insert with the following:

    • Partition by Kafka offset window or logical CDC batch time window (ingestion_timestamp, window_id).
    • Set:
      • batchsize = 10,000 to 100,000 depending on Hyperscale throughput.
      • numPartitions = <logical partitions count> (usually based on Kafka topic-partition structure).
    • Set isolationLevel to "NONE" or "READ_UNCOMMITTED" to avoid locks.
    • JDBC write options: Avoid single-threaded inserts or massive unpartitioned writes, they will choke Hyperscale log I/O.

      .option("batchsize", 50000) .option("truncate", "false") .option("isolationLevel", "NONE") .option("numPartitions", 8) .option("reliabilityLevel", "best_effort") # optional

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.