Guidance on Best Practices for Loading CDC-Merged Data from ADLS to Azure SQL Hyperscale

Question

Guidance on Best Practices for Loading CDC-Merged Data from ADLS to Azure SQL Hyperscale

Janice Chi 140

As part of our healthcare data modernization initiative, we are currently implementing the Catch-Up phase of a large-scale CDC pipeline. The architecture involves:

All Insert/Update/Delete logic (CDC merge) being handled in the Databricks branch layer (ADLS Gen2)

The resulting merged Delta data is to be loaded into Azure SQL Hyperscale

The main Hyperscale table already contains full historical data from a previous one-time load (80TB)

We are reaching out to confirm Microsoft-recommended best practices for the following scenario:

Question:

Given that all CDC merge logic is completed in Databricks and the data in ADLS is already final and de-duplicated, what is the recommended and performance-optimal approach for loading merged Delta data from ADLS to Azure SQL Hyperscale?

Specifically:

Is it recommended to perform a direct batch insert (via Databricks JDBC) into Hyperscale main table without using an intermediate staging table in this case?

Are there size limits, concurrency concerns, or tuning guidelines we should follow when using Spark JDBC for large volumes (partitioned by Kafka offset windows)?

Can Microsoft suggest checkpointing or retry-safe patterns that are compatible with ADLS + Databricks → Hyperscale ingestion for CDC-based pipelines?

Should Hyperscale be prepared with any specific indexing or minimal constraints to ensure optimal write performance from Spark?

We are aiming for a pattern that is idempotent, retry-safe, and aligned with reconciliation — where control tables track offset window status per topic + partition.

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-23T01:52:38.2033333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-23T01:59:55.5733333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-23T01:52:38.2033333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-23T01:59:55.5733333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

Your architecture aligns well with common modern data modernization patterns, especially in the healthcare sector, where historical and CDC driven changes are being unified. The following are couple of Microsoft & real world tested best practices for optimal performance, idempotency, and reliability.

Use of Staging Table: Recommended

Why: Even if your Delta output is deduplicated and finalized, using an intermediate staging table (in Hyperscale) provides the following benefits:

Supports idempotent retry logic (load -> validate -> merge -> commit).

Enables bulk insert operations without locking or impacting the main table.

Allows partition wise reconciliation using Kafka offset metadata.

Best Practice:

Create a staging table schema identical to the main table with no constraints (no PKs, minimal indexes).

Include a metadata column (kafka_topic, partition, offset_window) to track CDC batch context.

Insert Mode from Databricks: Via JDBC with Partitioning

Recommended Pattern: Use Spark JDBC + batch insert with the following:

Partition by Kafka offset window or logical CDC batch time window (ingestion_timestamp, window_id).
Set:
- batchsize = 10,000 to 100,000 depending on Hyperscale throughput.
- numPartitions = <logical partitions count> (usually based on Kafka topic-partition structure).
Set isolationLevel to "NONE" or "READ_UNCOMMITTED" to avoid locks.
JDBC write options: Avoid single-threaded inserts or massive unpartitioned writes, they will choke Hyperscale log I/O.

.option("batchsize", 50000) .option("truncate", "false") .option("isolationLevel", "NONE") .option("numPartitions", 8) .option("reliabilityLevel", "best_effort") # optional

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Janice Chi 140 Reputation points

2025-06-21T12:16:41.71+00:00

how staging supports idempotent retry logic (load -> validate -> merge -> commit), will not this be an extra hop ?
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2025-06-21T14:48:14.31+00:00

correct! but the justification and often necessary would be since in a large scale CDC setup like yours and particularly in healthcare where the data correctness, traceability, and retry safety are critical.

If you write directly into the main hyperscale table:

Any failure (network drop, Spark executor crash, partial insert) leaves the table in a partially updated and undefined state.

You cannot safely replay the batch unless you build complex logic to detect duplicates or rollback rows

You are not doing a roundtrip to another system; the staging table is in the same Hyperscale database. A typical merge from staging to main using MERGE or INSERT ... ON DUPLICATE KEY is fast especially since:

Both tables can share indexes and schema.

The merge is set based, not row-by-row.

Indexes can be rebuilt post merge if needed.

yes, it is a hop but one that guarantees safety, simplifies reprocessing, and protects your main table.

hth!

Share via

Guidance on Best Practices for Loading CDC-Merged Data from ADLS to Azure SQL Hyperscale

1 answer

Your answer