Perform Change Data Capture

Anand Deshpande 20 Reputation points
2025-06-26T13:49:47.1466667+00:00

Hi All,

Do you have any pointers on how to perform CDC - change data capture (inserts, updates, deletes) on parquet files in ADLS gen 2 storage. I have another ETL tool called Ab initio but we want synapse to do the heavy lifting.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
0 comments No comments
{count} votes

Accepted answer
  1. Venkat Reddy Navari 2,975 Reputation points Microsoft External Staff Moderator
    2025-06-26T14:36:55.66+00:00

    Hi @Anand Deshpande Tracking inserts, updates, and deletes on raw Parquet files in ADLS Gen2 is a common scenario especially since Parquet doesn’t store historical changes out of the box. To handle CDC (Change Data Capture) effectively, you’ll need to implement comparison logic externally.

    Practical Steps to Implement CDC in Synapse

    Load the Latest Snapshot Start by loading your latest Parquet file into Synapse. You can use serverless SQL with OPENROWSET, or go the Spark route if you’re dealing with larger volumes or complex transformations.

    
    SELECT *  
    FROM OPENROWSET(  
        BULK 'https://<storage-account>.dfs.core.windows.net/<container>/<path>/*.parquet',  
        FORMAT = 'PARQUET'  
    ) AS [new_data];
    

    Compare Against Previous Snapshot: Keep a previous version of the data in a Synapse table. Then compare it with the new load using SQL logic:

    • EXCEPT works well for basic insert/delete detection.
    • For updates, calculate a hash (using HASHBYTES) across key and value columns, then compare rows with the same ID but different hash values.

    Identify Inserts, Deletes, Updates Here’s how you might break it down:

    • Inserts: new_data EXCEPT old_snapshot
    • Deletes: old_snapshot EXCEPT new_data
    • Updates: join on key columns and compare hash values

    Refresh the Snapshot Table Once the deltas are processed, replace or update your snapshot table so it's ready for the next CDC cycle.

    Orchestrate with Pipelines You can wrap all of this in a Synapse pipeline to automate the workflow load new files, run comparisons, and publish changes.

    If your team is open to it, consider storing your files in Delta Lake format instead of raw Parquet. Synapse Spark (or even Databricks) supports Delta, which gives you:

    • Native support for MERGE, UPDATE, DELETE
    • Time travel and built-in CDC features More info here: Delta Lake documentation

    Hope this helps. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.