Incremental/Delta copy from Azure SQL Database to ADLS Gen 2 Delta Lake

Question

Incremental/Delta copy from Azure SQL Database to ADLS Gen 2 Delta Lake

Neal Lockhart 40

Hi all!

I've been working on this for a couple of weeks and I would appreciate your input:

I'm attempting to build an incremental copy pipeline with ADF to get the data into a data lake. It seems like I need to use a dataflow for this, so I've done that. I have the source set up as a single table in AZ sql db with the "incremental column" set up as "lastUpdateDate", and the sink set up as an inline delta table (it seems the table must be inline for delta to work).

What I want is for the data in the lake to be a direct copy of what's in the db - meaning, if a row is deleted, I want it deleted from the delta lake too. So, I've enabled the options "allow upsert, delete, insert, and update" on the sink settings for delta. If you enable these options, it requires an alter row activity.

I am unsure of how to set up the logic for alter row - I'd assume I can just use upsert for the update & insert piece, so maybe upsert if true(), but I'm not sure how to handle deletes in that case. This mostly comes from a place of not knowing ADF expressions very well.

So, the bottom line question is: How should I set up the alterrow logic in order to mirror the transactions we have in the sql database for a delta lake as the output with an incremental column set?

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-19T21:42:34.18+00:00

@Neal Lockhart

You're on the right track with setting up an incremental copy from Azure SQL Database to Delta Lake using Azure Data Factory (ADF). Below is a detailed explanation on how to configure the Alter Row transformation to handle inserts, updates, and deletes effectively.

Incremental Load (Inserts/Updates) - Use the lastUpdateDate column as a watermark to filter changed records from the source. In your Source transformation, apply a filter like: WHERE lastUpdateDate > '@{pipeline().parameters.LastRunTime}'Store and update this watermark in a control table or variable for subsequent runs.

Reference - Incremental Copy Pattern - Microsoft Docs

Handling Deletes - ADF doesn't detect deletions by default. You need one of the following:
A soft delete column in your source table (e.g., isDeleted = 1). Use Change Data Capture (CDC) or Change Tracking in Azure SQL DB to capture delete operations explicitly. Without such a mechanism, true deletes can't be mirrored in the sink.

Reference - CDC in Azure SQL

Alter Row Transformation - Use the Alter Row transformation to define row-level operations for the Delta Lake sink. Example using an isDeleted flag: iif(isDeleted == true(), 'delete', 'upsert') This logic marks rows with isDeleted = true for deletion. Uses upsert for all others (covers inserts and updates).

Reference: Alter Row Transformation - Docs

Delta Lake Sink Configuration In your Sink settings - Enable: Insert, Update, Upsert, Delete. Set Key column(s) (e.g., ID) for Delta to match records. Sink must be an inline dataset for Delta Lake compatibility.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator

2025-05-20T07:09:31.9666667+00:00

@Neal Lockhart
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Neal Lockhart 40 Reputation points

2025-05-20T20:38:33.35+00:00

@ganesh, thank you for your answer!

Incremental Load (Inserts/Updates) - Use the lastUpdateDate column as a watermark to filter changed records from the source. In your Source transformation, apply a filter like: WHERE lastUpdateDate > '@{pipeline().parameters.LastRunTime}'Store and update this watermark in a control table or variable for subsequent runs.

So for this piece, I have checked the box on the source:

So given that I am using the incremental column option, do I still need to use the watermark logic? I was under the impression that this would handle the watermark part for me. I do not 100% know how the incremental column feature works on a dataflow, I was hoping it would pass what DB operation was done on the row up the chain, but it seems that is not the case per your comment on deletes.

So, given that I am using this "incremental column" feature, how should I build the alterrow logic for an inline delta sink?
Neal Lockhart 40 Reputation points

2025-05-20T20:58:10.4033333+00:00

I'm basically trying to do what the guy in the video on this page is doing:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

If you scroll down, there is a video in the middle where a guy is setting up CDC/incremental column from sql server to a delta lake sink. He mentions if you're using "native CDC" you do not need an alterrow activity, is that not what we are doing? I don't see the same checkboxes as those that exist on his version...I do not see an option on the source for "native CDC" (which I can definitely enable on my source table if need be..)
Neal Lockhart 40 Reputation points

2025-05-20T21:01:31.5766667+00:00

This is how I have the source activity set up in ADF:

So I'm trying to use the incremental column to track updates... but if we can enable true CDC somehow that would be helpful, the options I have in ADF are different from the video in the official doc I posted.

Accepted answer

1 additional answer

Your answer

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-19T21:42:34.18+00:00

@Neal Lockhart

You're on the right track with setting up an incremental copy from Azure SQL Database to Delta Lake using Azure Data Factory (ADF). Below is a detailed explanation on how to configure the Alter Row transformation to handle inserts, updates, and deletes effectively.

Incremental Load (Inserts/Updates) - Use the lastUpdateDate column as a watermark to filter changed records from the source. In your Source transformation, apply a filter like: WHERE lastUpdateDate > '@{pipeline().parameters.LastRunTime}'Store and update this watermark in a control table or variable for subsequent runs.

Reference - Incremental Copy Pattern - Microsoft Docs

Handling Deletes - ADF doesn't detect deletions by default. You need one of the following:
A soft delete column in your source table (e.g., isDeleted = 1). Use Change Data Capture (CDC) or Change Tracking in Azure SQL DB to capture delete operations explicitly. Without such a mechanism, true deletes can't be mirrored in the sink.

Reference - CDC in Azure SQL

Alter Row Transformation - Use the Alter Row transformation to define row-level operations for the Delta Lake sink. Example using an isDeleted flag: iif(isDeleted == true(), 'delete', 'upsert') This logic marks rows with isDeleted = true for deletion. Uses upsert for all others (covers inserts and updates).

Reference: Alter Row Transformation - Docs

Delta Lake Sink Configuration In your Sink settings - Enable: Insert, Update, Upsert, Delete. Set Key column(s) (e.g., ID) for Delta to match records. Sink must be an inline dataset for Delta Lake compatibility.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator

2025-05-20T07:09:31.9666667+00:00

@Neal Lockhart
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Neal Lockhart 40 Reputation points

2025-05-20T20:38:33.35+00:00

@ganesh, thank you for your answer!

Incremental Load (Inserts/Updates) - Use the lastUpdateDate column as a watermark to filter changed records from the source. In your Source transformation, apply a filter like: WHERE lastUpdateDate > '@{pipeline().parameters.LastRunTime}'Store and update this watermark in a control table or variable for subsequent runs.

So for this piece, I have checked the box on the source:

So given that I am using the incremental column option, do I still need to use the watermark logic? I was under the impression that this would handle the watermark part for me. I do not 100% know how the incremental column feature works on a dataflow, I was hoping it would pass what DB operation was done on the row up the chain, but it seems that is not the case per your comment on deletes.

So, given that I am using this "incremental column" feature, how should I build the alterrow logic for an inline delta sink?
Neal Lockhart 40 Reputation points

2025-05-20T20:58:10.4033333+00:00

I'm basically trying to do what the guy in the video on this page is doing:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

If you scroll down, there is a video in the middle where a guy is setting up CDC/incremental column from sql server to a delta lake sink. He mentions if you're using "native CDC" you do not need an alterrow activity, is that not what we are doing? I don't see the same checkboxes as those that exist on his version...I do not see an option on the source for "native CDC" (which I can definitely enable on my source table if need be..)
Neal Lockhart 40 Reputation points

2025-05-20T21:01:31.5766667+00:00

This is how I have the source activity set up in ADF:

So I'm trying to use the incremental column to track updates... but if we can enable true CDC somehow that would be helpful, the options I have in ADF are different from the video in the official doc I posted.

Answer 1

Hello @Neal Lockhart, Can you try the below ADF dataflow for incremental load User's image

To support incremental data loading, we create a table to store the watermark values for each source table. This watermark table includes the following columns:

WatermarkTableName: The name of the source table.
WatermarkColumn: The name of the column used as the watermark (e.g., a timestamp or incremental ID).
WatermarkValue: The last processed value of the watermark column.

Watermark Table

`CREATE TABLE [dbo].[WatermarkTable] (`

NULL,

NULL

) ON [PRIMARY];

GO

Insert Initial Watermark Record

Next, insert the initial watermark record into the WatermarkTable. This entry should include the name of the source table, the column used for tracking changes (e.g., a timestamp or ID), and the last loaded value. This insert is a one-time setup—moving forward, the WatermarkValue will be updated after each incremental load.

In this case, the table will contain the following record:

In this Data Flow, we aim to include only the records from the source table or query where the value is greater than the previously stored watermark value. This ensures that only new or updated data since the last load is processed.

Source 1

In the source and click on "Source Options". Select "Query" and write the query. Click on "Import Schema" and at last we can preview data.

Source 2: watermark table

This source uses a simple query to retrieve data from the WatermarkTable. The configuration is similar to Source 1, but with a different query. Later, we can refine this by ensuring that only the relevant watermark value for the specific table is selected—using a join to match the correct table in the WatermarkTable.

SELECT

[WatermarkTableName]

, [WatermarkColumn]

, [WatermarkValue]

FROM [dbo].[WatermarkTable]

Use Derived Column:

Watermark values can have different data types, which is why we store them as nvarchar in the WatermarkTable. In this case, the watermark is of type datetime, so we need to convert it appropriately to ensure a successful join with the watermark table. This conversion is handled using the expression language within Mapping Data Flow.

Use Join Transformation:

To combine data from different sources, we use a join operation. Options include 'Full Outer', 'Inner', 'Left Outer', 'Right Outer', or 'Cross' join. In this scenario, we want to ensure the correct watermark value is applied for the incremental load of a specific table. We're using a Left Outer Join, but an Inner Join would also work since all records refer to the same table. However, joining on the table name from the watermark table is a more future-proof approach—relying on a join based solely on the watermark column could lead to issues when multiple tables share the same watermark column name.

Use Filter Transformation:

Since the "Join" transformation in Data Flow only allows joining on columns with equal values, we use a "Filter" transformation afterward to include only the records where the value from the source table or query is greater than the latest watermark value.

Use Select Transformation:

Here we have to use the "Select" component to select the relevant columns.

Use Derived Column 2 Transformation:

Before configuring the destination, we add a new "Derived Column" transformation to convert the LastEditedWhen column back to a date. This step is necessary to ensure proper mapping with the Orders_Incremental table, as it follows the original schema of the Orders table, where the LastEditedWhen column is defined with a date/time data type.

Use Sink

In Azure Data Factory (ADF), the destination is referred to as a "Sink". In this step, we select our target table named Orders_Incremental, which shares the same schema as the original Orders table. After creating the Sink dataset, the columns will be mapped automatically. If needed, you can disable Auto Mapping to manually configure the column mappings.

Update Watermark

Finally, we need to update the watermark value to reflect the most recent value—in this case, the latest LastEditedWhen date. To accomplish this, we’ll use a simple stored procedure that performs the update.

sql

CREATE PROCEDURE [dbo].[usp_UpdateWatermark]

@tableName nvarchar(255)

AS

BEGIN

DECLARE

@watermarkValue nvarchar(255)

SELECT

@watermarkValue = MAX([LastEditedWhen])

FROM [Sales].[Orders_Incremental] AS T

UPDATE [dbo].[WatermarkTable]

SET [WatermarkValue] = @watermarkValue

WHERE [WatermarkTableName] = @tableName

END

GO

Answer 2

Alex Burlachenko 9,780

Hi Neal Lockhart

thank you for posting your question on the Q&A portal, so I’ll try to explain this as clearly as I can.

First, you’re on the right track with using a dataflow in Azure Data Factory for this. Setting the incremental column to lastUpdateDate is a good approach since it lets you pull only the new or changed data efficiently. For more details on incremental loading in ADF, you can check the official Microsoft documentation on incremental loading.

Now, about the alter row logic—this is where things get interesting. Since you want to mirror the SQL database exactly (including deletes), you’ll need to handle inserts, updates, and deletes in your dataflow. Here’s how you can set it up:

For inserts and updates, your idea of using upsert if true() is correct. This will handle new rows and changes to existing ones. The expression could be as simple as checking if the row exists in the delta lake already, but since you’re using an incremental column, ADF will handle this for you if configured properly.

For deletes, it’s a bit more involved. You’ll need a way to identify rows that exist in the delta lake but no longer exist in the source. One way to do this is to use a join or lookup in your dataflow to compare the source and sink, then apply delete if true() for those missing rows. The Microsoft documentation on alter row transformations explains how to set up these conditions.

A small tip: make sure your delta lake sink is configured correctly for these operations. Since you’re using an inline dataset, the table structure must match the source, and the delta lake must support transactional writes.

Lastly, don’t worry if this feels overwhelming ADF expressions take some getting used to! Start with simple conditions and test each part of your pipeline step by step. If you run into errors, the debug mode in ADF is super helpful for seeing what’s happening at each stage.

Hope this helps.

Best regards,
Alex
P.S. If my answer help to you, please Accept my answer
PPS That is my Answer and not a Comment
https://ctrlaltdel.blog/

Neal Lockhart 40 Reputation points

2025-05-20T20:32:21.63+00:00

Hi - thank you for the answer! Could you please provide specifics on how to get the delete logic right? You say to use delete if true for missing rows, but I need to know how to identify which rows are missing (we do have a primary key for this table of course, so the logic could be built on that). I do not know how to reference source vs sink when we are doing alterrow logic, so how could I write an expression for "if the ID is in the sink but the ID is not in the source, delete from the sink"?

I need to know where to place this logic within my existing pipeline, or if I should build a new one or just add branching logic.

This is currently what the pipeline looks like:

So its very basic, I simply want to copy whats going on in the DB to a delta table. Currently I do not have alterrow logic set up as I do not know how to do it properly.
Dileep Raj Narayan Thumula 255 Reputation points Microsoft External Staff Moderator

2025-05-22T03:41:08.57+00:00

Hello,Neal Lockhart,We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Neal Lockhart 40 Reputation points

2025-05-22T13:24:16.65+00:00

@Dileep Raj Narayan Thumula Thank you so much for the in depth reply! This is excellent, I really appreciate the screenshots and detail. My concern with this method is that it is too many reads and, specifically, too many writes on the source database. We have a very high transaction volume and adding new columns to the database isnt really an option. That being said, the content you posted will definitely help someone else in the future.

I actually discovered late yesterday that if I do NOT use query mode on the source and simply use the "table" source option, I am able to use "native CDC":

This does not require an Alter Row activity!

See below:

I have not seen this detailed anywhere, I was attempting to use a query but I see now that with a query we cannot use Native CDC... I really hope this helps someone else who is struggling.
Dileep Raj Narayan Thumula 255 Reputation points Microsoft External Staff Moderator

2025-05-23T01:38:08.7733333+00:00

Hi @Neal Lockhart, You are right not using query option will surely help you. If you feel my response is relevant and could help others in the future, I can convert my comment into an official answer. That way, if you accept it, it will be marked as "Verified" and benefit the broader community.

Share via

Incremental/Delta copy from Azure SQL Database to ADLS Gen 2 Delta Lake

1 additional answer

Your answer