Hi @Anand Deshpande Tracking inserts, updates, and deletes on raw Parquet files in ADLS Gen2 is a common scenario especially since Parquet doesn’t store historical changes out of the box. To handle CDC (Change Data Capture) effectively, you’ll need to implement comparison logic externally.
Practical Steps to Implement CDC in Synapse
Load the Latest Snapshot Start by loading your latest Parquet file into Synapse. You can use serverless SQL with OPENROWSET
, or go the Spark route if you’re dealing with larger volumes or complex transformations.
SELECT *
FROM OPENROWSET(
BULK 'https://<storage-account>.dfs.core.windows.net/<container>/<path>/*.parquet',
FORMAT = 'PARQUET'
) AS [new_data];
Compare Against Previous Snapshot: Keep a previous version of the data in a Synapse table. Then compare it with the new load using SQL logic:
-
EXCEPT
works well for basic insert/delete detection. - For updates, calculate a hash (using
HASHBYTES
) across key and value columns, then compare rows with the same ID but different hash values.
Identify Inserts, Deletes, Updates Here’s how you might break it down:
- Inserts:
new_data EXCEPT old_snapshot
- Deletes:
old_snapshot EXCEPT new_data
- Updates: join on key columns and compare hash values
Refresh the Snapshot Table Once the deltas are processed, replace or update your snapshot table so it's ready for the next CDC cycle.
Orchestrate with Pipelines You can wrap all of this in a Synapse pipeline to automate the workflow load new files, run comparisons, and publish changes.
If your team is open to it, consider storing your files in Delta Lake format instead of raw Parquet. Synapse Spark (or even Databricks) supports Delta, which gives you:
- Native support for
MERGE
,UPDATE
,DELETE
- Time travel and built-in CDC features More info here: Delta Lake documentation
Hope this helps. If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.