Perform Change Data Capture

Question

Perform Change Data Capture

Anand Deshpande 20

Hi All,

Do you have any pointers on how to perform CDC - change data capture (inserts, updates, deletes) on parquet files in ADLS gen 2 storage. I have another ETL tool called Ab initio but we want synapse to do the heavy lifting.

Accepted answer

0 additional answers

Your answer

Answer 1

Venkat Reddy Navari 2,975 Microsoft External Staff Moderator

Hi @Anand Deshpande Tracking inserts, updates, and deletes on raw Parquet files in ADLS Gen2 is a common scenario especially since Parquet doesn’t store historical changes out of the box. To handle CDC (Change Data Capture) effectively, you’ll need to implement comparison logic externally.

Practical Steps to Implement CDC in Synapse

Load the Latest Snapshot Start by loading your latest Parquet file into Synapse. You can use serverless SQL with OPENROWSET, or go the Spark route if you’re dealing with larger volumes or complex transformations.


SELECT *  
FROM OPENROWSET(  
    BULK 'https://<storage-account>.dfs.core.windows.net/<container>/<path>/*.parquet',  
    FORMAT = 'PARQUET'  
) AS [new_data];

Compare Against Previous Snapshot: Keep a previous version of the data in a Synapse table. Then compare it with the new load using SQL logic:

EXCEPT works well for basic insert/delete detection.
For updates, calculate a hash (using HASHBYTES) across key and value columns, then compare rows with the same ID but different hash values.

Identify Inserts, Deletes, Updates Here’s how you might break it down:

Inserts: new_data EXCEPT old_snapshot
Deletes: old_snapshot EXCEPT new_data
Updates: join on key columns and compare hash values

Refresh the Snapshot Table Once the deltas are processed, replace or update your snapshot table so it's ready for the next CDC cycle.

Orchestrate with Pipelines You can wrap all of this in a Synapse pipeline to automate the workflow load new files, run comparisons, and publish changes.

If your team is open to it, consider storing your files in Delta Lake format instead of raw Parquet. Synapse Spark (or even Databricks) supports Delta, which gives you:

Native support for MERGE, UPDATE, DELETE
Time travel and built-in CDC features More info here: Delta Lake documentation

Hope this helps. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Anand Deshpande 20 Reputation points

2025-06-26T14:52:21.37+00:00

Thanks Venkat for your views.
So if I understood it correctly, we can load data from on-prem Ab initio queues to Parquet which is latest snapshot data fetched from source. I need to compare this against full target table (because this is how we do it on our on-prem counterpart - source file is compared against full target table in on prem DB2. also in old cloud solution, data present in queue is compared against parquet data present in ADLS2. This solution in cloud gives us issues like Java heap memory). What do I compare the latest snapshot against?
Shall I unload the latest snapshot in a temporary table and compare against table formed on top of accumulated parquet data.

OR

Shall I compare latest snapshot in parquet and compare against accumulated parquet data.

Please advise how do we do it in Synapse...
Venkat Reddy Navari 2,975 Reputation points Microsoft External Staff Moderator

2025-06-26T16:55:02.82+00:00
@Anand Deshpande Thanks for the detailed follow-up. You're right in your understanding: you’re loading the latest snapshot from Ab Initio (via queue or file) into ADLS Gen2 as Parquet. Now, the goal is to detect changes by comparing it with previously accumulated data just like how you did it in on-prem DB2 or your earlier cloud setup.

Should you compare the latest Parquet snapshot against a full Synapse table, or compare it directly against accumulated Parquet data in ADLS Gen2?

Go with Table-to-Table Comparison in Synapse:

Load the latest Parquet snapshot into a temporary staging table (you can use OPENROWSET or a Synapse pipeline).

Compare it against a target Synapse table that holds the full accumulated data.

Use EXCEPT for inserts/deletes

For updates, do a join on the primary key and compare hashes (HASHBYTES() works well)

Once changes are identified, update the target table using MERGE or upserts.

Avoid Direct Parquet-to-Parquet Comparison: It may seem simpler but comparing large Parquet files directly especially using Spark or serverless SQL, can hit memory limits (like the Java heap errors you mentioned). It’s also harder to optimize and debug.

Why Use Tables: Tables give you better control, easier joins, and performance tuning. Plus, this setup is pretty close to what you're already doing with DB2 on-prem—just cloud-optimized.
Anand Deshpande 20 Reputation points

2025-06-27T08:15:05.04+00:00

Thank you @Venkat Reddy Navari for spreading more light on that piece.
Do you think Synapse has the capability to add a few technical fields to result of merge operation, if yes how?
So in current situation, we do join operation of latest snapshot and accumulated snapshot based on keys and later we add columns like RES_YEAR, RES_MONTH, RES_DATE and process_id to parquet files which is then stored in 16 ways parallel partitions. We would like to replicate the same thing using synapse.
Venkat Reddy Navari 2,975 Reputation points Microsoft External Staff Moderator

2025-06-27T12:07:15.4066667+00:00
@Anand Deshpande yes, Synapse can add technical fields like RES_YEAR, RES_MONTH, RES_DATE, and process_id during your CDC processing.

In SQL (Serverless or Dedicated Pool): Add metadata in your SELECT before MERGE or INSERT:

SELECT *, YEAR(GETDATE()) AS RES_YEAR, MONTH(GETDATE()) AS RES_MONTH, DAY(GETDATE()) AS RES_DATE, 'your_process_id' AS process_id FROM staging_table;

In Spark (for Parquet + Partitioning):

df.withColumn("RES_YEAR", ...) \ .withColumn("RES_MONTH", ...) \ .withColumn("RES_DATE", ...) \ .withColumn("process_id", ...) \ .write.partitionBy("RES_YEAR", "RES_MONTH", "RES_DATE") \ .parquet("abfss://...")

Using Spark is ideal if you want to mimic your current partitioned Parquet output.
Anand Deshpande 20 Reputation points

2025-06-27T12:45:38.1933333+00:00

Okay, I think that's the R code but we can do it in Python as well, Pyspark what they call. I would also like to understand fundamental difference between Azure Databricks and Azure Synapse from costing perspective as well. Thanks.

Share via

Perform Change Data Capture

0 additional answers

Your answer