CDC Merge Hyperscale Options

Question

CDC Merge Hyperscale Options

Janice Chi 140

In our current project, we have already completed a historical load of ~80 TB into Azure SQL Hyperscale, and the table content in Hyperscale is in sync with our "branch" Delta Lake table in Databricks.

For catch-up CDC ingestion, incremental I/U/D operations come via IBM InfoSphere CDC to Kafka, and we flatten those CDC events into a Delta table (CDC_Flattened).

We are evaluating two architecture options to bring these CDC changes into Hyperscale:

Option 1: Merge into Delta Branch, then copy to Hyperscale Staging → Hyperscale Main Option 2: Skip Delta Branch; Merge directly from CDC_Flattened into Hyperscale Staging/Main

Our customer prefers Option 2 to avoid extra hops. Could you please confirm:

Is Option 2 (direct CDC merge into Hyperscale) technically safe and supported under enterprise-grade reliability?

Are there any performance, merge conflict, or transaction boundary concerns while doing direct CDC merge into Hyperscale at scale (~10s of millions per day)?

What are the recommended best practices, especially around:

Merge keys (e.g., PK, natural key)

  Error handling and retries
  
     Recon and deduplication
     
        Merge batching (e.g., per partition, per offset range)
        
        Any known **limitations or considerations** in Hyperscale (concurrency, LSN locks, write throughput) when used for frequent streaming-like merges?

We are using Azure Databricks for all transformation and orchestration and writing to Hyperscale via JDBC.

Janice Chi 140 Reputation points

2025-06-18T05:54:08.4+00:00

please compare option 1 and option2 mainly from latency and cost perspectives

Option 1: Merge into Delta Branch, then copy to Hyperscale Staging → Hyperscale Main Option 2: Skip Delta Branch; Merge directly from CDC_Flattened into Hyperscale Staging/Main

Chandra Boorla 14,585 Microsoft External Staff Moderator

@Janice Chi

Thank you for the follow up, here’s a focused comparison between Option 1 and Option 2 specifically from a latency and cost perspective:

Criteria	Option 1Merge into Delta Branch → Copy to Hyperscale	Option 2Direct Merge from CDC_Flattened into Hyperscale
Latency	Higher latency - Multiple steps (Delta merge → file gen → copy)- File handoff adds delay. Mitigation - Use incremental file pruning to reduce processing time	Lower latency - No intermediate storage/copy- Near-real-time ingestion. Critical - Requires micro-batching (≤100k rows) to prevent transaction log bloat
Databricks Compute Cost	Higher - Extra compute for Delta merges + file writes- More cluster time. Optimization - Schedule OPTIMIZE/Z-Order post-merge	Lower - Reduced transformation workload. Caveat - Cost advantage assumes efficient parallel writes; retries may increase cost
Storage Cost	Higher - Temp files in ADLS (Parquet/CSV). Mitigation - Apply lifecycle policies + partition pruning	Lower - No additional files. Verified - Only `CDC_Flattened` metadata stored
Operational Overhead	Higher - Multi-step orchestration- More failure points. Advantage - Built-in replay via Delta versioning	Lower - Fewer moving parts. Critical - Must implement DLQs + idempotent retries for JDBC failures
Flexibility	More flexible - Built-in audit trail- Supports complex SCD logic. Best for - Historical backfills or replay needs	Less flexible - All transforms must happen upstream. Limitation - Hyperscale lacks native CDC replay; requires strict sequence-based dedupe
Best Use Case	Choose when - Replayability, audit, or staging control is important- Compliance or SCD logic needed	Choose when - Low latency and cost efficiency are the priority- Ingestion is idempotent and well-partitioned

Conclusion -

Option 2 is generally preferred when your priority is low latency, reduced cost, and simplified orchestration.
Option 1 may still be a good fit if you need additional transformation layers, auditing, or recovery flexibility.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Janice Chi 140 Reputation points

2025-06-18T14:08:58.8166667+00:00

please explian with example-Less flexible - All transforms must happen upstream. Limitation - Hyperscale lacks native CDC replay; requires strict sequence-based dedupe

Chandra Boorla 14,585 Microsoft External Staff Moderator

@Janice Chi

Thanks for the follow-up! You're absolutely right to probe deeper into the flexibility and replay ability concerns with Option 2. Here's a detailed explanation with examples to clarify:

Less Flexible – All Transforms Must Happen Upstream

In Option 2, you're bypassing an intermediate Delta transformation layer. As a result, any data cleansing, enrichment, or business rule application must occur entirely upstream, before you merge into Hyperscale.

Example:

{ "id": 1001, "first_name": "John", "last_name": null }

Let’s say your rule is to default null last_name values to "Unknown". In Option 1, you'd typically apply this rule within the Delta “branch” layer prior to pushing to SQL. But with Option 2, this logic must be applied in Databricks—directly before the JDBC write—because Hyperscale is just the destination, not a transformation engine.

Limitation – Hyperscale Lacks Native CDC Replay; Requires Sequence-Based Deduplication

Azure SQL Hyperscale does not maintain historical versions or built-in CDC checkpoints, unlike Delta Lake which supports time travel and versioning.

Example:

If the same record (id = 1001) is updated twice:

Offset 100: first_name = "John"
Offset 101: first_name = "Johnny"

If your pipeline retries both offsets due to a failure, and no deduplication logic is in place, you risk overwriting the newer value ("Johnny") with the older one ("John").

To prevent this, your ingestion logic must:

Track sequencing metadata (e.g., Kafka offset, event ID, change version)
Ensure only the latest event per key is merged

Recommended Practice:

Include a column like cdc_offset or event_sequence_id in the source Delta table.

Apply merge filters like:

  WHERE incoming.offset > target.last_applied_offset

Design the merge to be idempotent and ordered, using micro-batches with checkpoints.

Summary:

Limitation	Impact	Recommended Solution
Transforms must happen upstream	Less flexibility in applying business logic	Perform all enrichment in Databricks before merge
No CDC replay in Hyperscale	Risk of replaying stale/duplicate updates	Track sequence IDs and enforce strict deduplication logic

I hope this information helps.

Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator

2025-06-19T18:31:08.9366667+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Janice Chi 140 Reputation points

2025-06-18T05:54:08.4+00:00

please compare option 1 and option2 mainly from latency and cost perspectives

Option 1: Merge into Delta Branch, then copy to Hyperscale Staging → Hyperscale Main Option 2: Skip Delta Branch; Merge directly from CDC_Flattened into Hyperscale Staging/Main
Janice Chi 140 Reputation points

2025-06-18T14:08:58.8166667+00:00

please explian with example-Less flexible - All transforms must happen upstream. Limitation - Hyperscale lacks native CDC replay; requires strict sequence-based dedupe
Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator

2025-06-19T18:31:08.9366667+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

@Janice Chi

Thank you for outlining your architecture and the two options under evaluation. Given your scale and the criticality of maintaining reliable CDC ingestion into Azure SQL Hyperscale, here’s a breakdown of Option 2 and best practices based on Microsoft’s guidance and large-scale implementations.

Is Option 2 (Direct Merge into Hyperscale) Technically Safe and Enterprise-Grade?

Yes, Option 2 is technically feasible and can be made enterprise-grade if implemented with proper orchestration. Azure SQL Hyperscale is architected for high concurrency and throughput, but some important considerations apply when performing frequent MERGE or UPSERT operations at scale.

Key considerations for Hyperscale with Direct CDC Merge

Merge Performance & Concurrency

Hyperscale supports concurrent reads/writes well due to its architecture (separated compute & storage).
However, large or frequent MERGE statements can still contend on hot pages, indexes, and transaction logs.
For tens of millions of rows per day, it’s recommended to:
- Break merges into micro-batches (e.g., 10k–100k rows per batch).
- Use partition-based logic (e.g., by date, Kafka offset, shard ID).

Transaction Scope

Long-running transactions may lead to versioning pressure or log retention issues, even in Hyperscale.
Prefer smaller transactions, ideally idempotent operations (to allow retries).

Recommended best practices

Merge Keys - Use Primary Keys where available. If using a natural key, ensure it’s consistent and immutable across systems (to avoid mismatches).

Error Handling & Retry Logic - Implement upsert logic with retry policies in Databricks (try-catch blocks + retries for transient JDBC failures). Consider dead-lettering irrecoverable errors for offline review.

Reconciliation & Deduplication - Maintain an event_id or change_sequence_number column in CDC_Flattened. Use this to ensure idempotent writes and support exactly once processing.

Merge Batching Strategy - Batch by --> Kafka offset ranges, ingestion timestamp, or source table partition. Each batch should complete in <5 minutes to avoid long transactions.

Hyperscale-Specific considerations

Area	Consideration
Concurrency	Hyperscale handles concurrency well, but watch for contention on PKs or clustered indexes during merges.
Transaction Log (LSN)	Large merges may retain log records longer than expected, affecting tempdb or causing log growth.
Write Throughput	JDBC write speed depends on your Databricks driver config. Consider using parallel JDBC writes if needed.
Index Maintenance	Ensure indexes are optimized post-merge. Consider periodic rebuild/reorg if merge volume is high.

Conclusion:

Option 2 is a valid and efficient approach, especially when minimizing latency is a priority. To ensure reliability and scalability:

Batch and partition your merge logic,
Design for idempotency and retries,
Monitor log and index health,
And apply strong governance over key definitions.

If your use case is mission-critical, also consider maintaining an audit or reconciliation table to validate data integrity post-merge.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Share via

CDC Merge Hyperscale Options

1 answer

Your answer