CDC Merge Hyperscale Options

Janice Chi 140 Reputation points
2025-06-17T16:08:22.0066667+00:00

In our current project, we have already completed a historical load of ~80 TB into Azure SQL Hyperscale, and the table content in Hyperscale is in sync with our "branch" Delta Lake table in Databricks.

For catch-up CDC ingestion, incremental I/U/D operations come via IBM InfoSphere CDC to Kafka, and we flatten those CDC events into a Delta table (CDC_Flattened).

We are evaluating two architecture options to bring these CDC changes into Hyperscale:

Option 1: Merge into Delta Branch, then copy to Hyperscale Staging → Hyperscale Main Option 2: Skip Delta Branch; Merge directly from CDC_Flattened into Hyperscale Staging/Main

Our customer prefers Option 2 to avoid extra hops. Could you please confirm:

Is Option 2 (direct CDC merge into Hyperscale) technically safe and supported under enterprise-grade reliability?

Are there any performance, merge conflict, or transaction boundary concerns while doing direct CDC merge into Hyperscale at scale (~10s of millions per day)?

What are the recommended best practices, especially around:

Merge keys (e.g., PK, natural key)

  Error handling and retries
  
     Recon and deduplication
     
        Merge batching (e.g., per partition, per offset range)
        
        Any known **limitations or considerations** in Hyperscale (concurrency, LSN locks, write throughput) when used for frequent streaming-like merges?
        

We are using Azure Databricks for all transformation and orchestration and writing to Hyperscale via JDBC.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,517 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator
    2025-06-17T17:04:52.92+00:00

    @Janice Chi

    Thank you for outlining your architecture and the two options under evaluation. Given your scale and the criticality of maintaining reliable CDC ingestion into Azure SQL Hyperscale, here’s a breakdown of Option 2 and best practices based on Microsoft’s guidance and large-scale implementations.

    Is Option 2 (Direct Merge into Hyperscale) Technically Safe and Enterprise-Grade?

    Yes, Option 2 is technically feasible and can be made enterprise-grade if implemented with proper orchestration. Azure SQL Hyperscale is architected for high concurrency and throughput, but some important considerations apply when performing frequent MERGE or UPSERT operations at scale.

    Key considerations for Hyperscale with Direct CDC Merge

    Merge Performance & Concurrency

    • Hyperscale supports concurrent reads/writes well due to its architecture (separated compute & storage).
    • However, large or frequent MERGE statements can still contend on hot pages, indexes, and transaction logs.
    • For tens of millions of rows per day, it’s recommended to:
      • Break merges into micro-batches (e.g., 10k–100k rows per batch).
      • Use partition-based logic (e.g., by date, Kafka offset, shard ID).

    Transaction Scope

    • Long-running transactions may lead to versioning pressure or log retention issues, even in Hyperscale.
    • Prefer smaller transactions, ideally idempotent operations (to allow retries).

    Recommended best practices

    Merge Keys - Use Primary Keys where available. If using a natural key, ensure it’s consistent and immutable across systems (to avoid mismatches).

    Error Handling & Retry Logic - Implement upsert logic with retry policies in Databricks (try-catch blocks + retries for transient JDBC failures). Consider dead-lettering irrecoverable errors for offline review.

    Reconciliation & Deduplication - Maintain an event_id or change_sequence_number column in CDC_Flattened. Use this to ensure idempotent writes and support exactly once processing.

    Merge Batching Strategy - Batch by --> Kafka offset ranges, ingestion timestamp, or source table partition. Each batch should complete in <5 minutes to avoid long transactions.

    Hyperscale-Specific considerations

    Area Consideration
    Concurrency Hyperscale handles concurrency well, but watch for contention on PKs or clustered indexes during merges.
    Transaction Log (LSN) Large merges may retain log records longer than expected, affecting tempdb or causing log growth.
    Write Throughput JDBC write speed depends on your Databricks driver config. Consider using parallel JDBC writes if needed.
    Index Maintenance Ensure indexes are optimized post-merge. Consider periodic rebuild/reorg if merge volume is high.

    Conclusion:

    Option 2 is a valid and efficient approach, especially when minimizing latency is a priority. To ensure reliability and scalability:

    • Batch and partition your merge logic,
    • Design for idempotency and retries,
    • Monitor log and index health,
    • And apply strong governance over key definitions.

    If your use case is mission-critical, also consider maintaining an audit or reconciliation table to validate data integrity post-merge.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.