Hi ,
Thanks for reaching out to Microsoft Q&A.
Your architecture aligns well with common modern data modernization patterns, especially in the healthcare sector, where historical and CDC driven changes are being unified. The following are couple of Microsoft & real world tested best practices for optimal performance, idempotency, and reliability.
Use of Staging Table: Recommended
Why: Even if your Delta output is deduplicated and finalized, using an intermediate staging table (in Hyperscale) provides the following benefits:
- Supports idempotent retry logic (load -> validate -> merge -> commit).
Enables bulk insert operations without locking or impacting the main table.
- Allows partition wise reconciliation using Kafka offset metadata.
Best Practice:
Create a staging table schema identical to the main table with no constraints (no PKs, minimal indexes).
- Include a metadata column (
kafka_topic
,partition
,offset_window
) to track CDC batch context.
Insert Mode from Databricks: Via JDBC
with Partitioning
Recommended Pattern: Use Spark JDBC + batch insert with the following:
- Partition by Kafka offset window or logical CDC batch time window (
ingestion_timestamp
,window_id
). - Set:
-
batchsize = 10,000
to100,000
depending on Hyperscale throughput. -
numPartitions = <logical partitions count>
(usually based on Kafka topic-partition structure).
-
- Set
isolationLevel
to"NONE"
or"READ_UNCOMMITTED"
to avoid locks. - JDBC write options: Avoid single-threaded inserts or massive unpartitioned writes, they will choke Hyperscale log I/O.
.option("batchsize", 50000) .option("truncate", "false") .option("isolationLevel", "NONE") .option("numPartitions", 8) .option("reliabilityLevel", "best_effort") # optional
Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.