If your Azure Data Factory's built-in CDC feature does not support Azure PostgreSQL (it can be related to versioning issues), you can still design a custom CDC mechanism.
PostgreSQL supports logical replication which can be used for capturing changes. Ensure that logical replication is enabled on your Azure PostgreSQL database.
Maintain a separate table in PostgreSQL to track the changes for each table you wish to capture. The tracking table could have columns such as table_name
, last_processed_id
, and last_processed_timestamp
.
As for the Azure Data Factory Pipeline, you can proceed the following :
- Lookup Activity: Look up the last processed ID/timestamp from the tracking table.
- Copy Activity: Copy the new records from the source PostgreSQL table using a query that filters records based on the last processed ID/timestamp. Use a watermark approach where you filter records that have an ID/timestamp greater than the values fetched from the tracking table.
- Stored Procedure Activity: After copying the data to the destination, update the tracking table with the new last processed ID/timestamp.
For the Incremental Loads, ensure your tables in PostgreSQL have either an auto-incrementing ID or a timestamp column. This way, you can always fetch records that have been inserted/updated after the last fetched ID/timestamp.
To capture deleted records, you might need a soft delete mechanism in your source system where records are marked as deleted but not actually removed.
A timestamp or version column can help detect changes since the last CDC process for updated records.