Incremental/Delta copy from Azure SQL Database to ADLS Gen 2 Delta Lake

Neal Lockhart 40 Reputation points
2025-05-19T20:47:55.9966667+00:00

Hi all!

I've been working on this for a couple of weeks and I would appreciate your input:

I'm attempting to build an incremental copy pipeline with ADF to get the data into a data lake. It seems like I need to use a dataflow for this, so I've done that. I have the source set up as a single table in AZ sql db with the "incremental column" set up as "lastUpdateDate", and the sink set up as an inline delta table (it seems the table must be inline for delta to work).

What I want is for the data in the lake to be a direct copy of what's in the db - meaning, if a row is deleted, I want it deleted from the delta lake too. So, I've enabled the options "allow upsert, delete, insert, and update" on the sink settings for delta. If you enable these options, it requires an alter row activity.

I am unsure of how to set up the logic for alter row - I'd assume I can just use upsert for the update & insert piece, so maybe upsert if true(), but I'm not sure how to handle deletes in that case. This mostly comes from a place of not knowing ADF expressions very well.

So, the bottom line question is: How should I set up the alterrow logic in order to mirror the transactions we have in the sql database for a delta lake as the output with an incremental column set?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

Accepted answer
  1. Dileep Raj Narayan Thumula 255 Reputation points Microsoft External Staff Moderator
    2025-05-21T10:33:49.78+00:00

    Hello @Neal Lockhart, Can you try the below ADF dataflow for incremental load User's image

    To support incremental data loading, we create a table to store the watermark values for each source table. This watermark table includes the following columns:

    • WatermarkTableName: The name of the source table.
    • WatermarkColumn: The name of the column used as the watermark (e.g., a timestamp or incremental ID).
    • WatermarkValue: The last processed value of the watermark column.

    Watermark Table

    `CREATE TABLE [dbo].[WatermarkTable] (`
    

    NULL,

    NULL,

    NULL

    ) ON [PRIMARY];

    GO

    Insert Initial Watermark Record

    Next, insert the initial watermark record into the WatermarkTable. This entry should include the name of the source table, the column used for tracking changes (e.g., a timestamp or ID), and the last loaded value. This insert is a one-time setup—moving forward, the WatermarkValue will be updated after each incremental load.

    In this case, the table will contain the following record:
    image

    In this Data Flow, we aim to include only the records from the source table or query where the value is greater than the previously stored watermark value. This ensures that only new or updated data since the last load is processed.

    Source 1

    In the source and click on "Source Options". Select "Query" and write the query. Click on "Import Schema" and at last we can preview data.

    Source 2: watermark table

    This source uses a simple query to retrieve data from the WatermarkTable. The configuration is similar to Source 1, but with a different query. Later, we can refine this by ensuring that only the relevant watermark value for the specific table is selected—using a join to match the correct table in the WatermarkTable.

    SELECT

    [WatermarkTableName]

    , [WatermarkColumn]

    , [WatermarkValue]

    FROM [dbo].[WatermarkTable]

    Use Derived Column:

    Watermark values can have different data types, which is why we store them as nvarchar in the WatermarkTable. In this case, the watermark is of type datetime, so we need to convert it appropriately to ensure a successful join with the watermark table. This conversion is handled using the expression language within Mapping Data Flow.

    image

    Use Join Transformation:

    To combine data from different sources, we use a join operation. Options include 'Full Outer', 'Inner', 'Left Outer', 'Right Outer', or 'Cross' join. In this scenario, we want to ensure the correct watermark value is applied for the incremental load of a specific table. We're using a Left Outer Join, but an Inner Join would also work since all records refer to the same table. However, joining on the table name from the watermark table is a more future-proof approach—relying on a join based solely on the watermark column could lead to issues when multiple tables share the same watermark column name.

    image

    Use Filter Transformation:

    Since the "Join" transformation in Data Flow only allows joining on columns with equal values, we use a "Filter" transformation afterward to include only the records where the value from the source table or query is greater than the latest watermark value.

    image

    Use Select Transformation:

    Here we have to use the "Select" component to select the relevant columns.

    image

    Use Derived Column 2 Transformation:

    Before configuring the destination, we add a new "Derived Column" transformation to convert the LastEditedWhen column back to a date. This step is necessary to ensure proper mapping with the Orders_Incremental table, as it follows the original schema of the Orders table, where the LastEditedWhen column is defined with a date/time data type.

    image

    Use Sink

    In Azure Data Factory (ADF), the destination is referred to as a "Sink". In this step, we select our target table named Orders_Incremental, which shares the same schema as the original Orders table. After creating the Sink dataset, the columns will be mapped automatically. If needed, you can disable Auto Mapping to manually configure the column mappings.

    Update Watermark

    Finally, we need to update the watermark value to reflect the most recent value—in this case, the latest LastEditedWhen date. To accomplish this, we’ll use a simple stored procedure that performs the update.

    sql

    CREATE PROCEDURE [dbo].[usp_UpdateWatermark]

    @tableName nvarchar(255)

    AS

    BEGIN

    DECLARE

    @watermarkValue nvarchar(255)

    SELECT

    @watermarkValue = MAX([LastEditedWhen])

    FROM [Sales].[Orders_Incremental] AS T

    UPDATE [dbo].[WatermarkTable]

    SET [WatermarkValue] = @watermarkValue

    WHERE [WatermarkTableName] = @tableName

    END

    GO

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Alex Burlachenko 9,780 Reputation points
    2025-05-20T09:16:28.98+00:00

    Hi Neal Lockhart

    thank you for posting your question on the Q&A portal, so I’ll try to explain this as clearly as I can.

    First, you’re on the right track with using a dataflow in Azure Data Factory for this. Setting the incremental column to lastUpdateDate is a good approach since it lets you pull only the new or changed data efficiently. For more details on incremental loading in ADF, you can check the official Microsoft documentation on incremental loading.

    Now, about the alter row logic—this is where things get interesting. Since you want to mirror the SQL database exactly (including deletes), you’ll need to handle inserts, updates, and deletes in your dataflow. Here’s how you can set it up:

    For inserts and updates, your idea of using upsert if true() is correct. This will handle new rows and changes to existing ones. The expression could be as simple as checking if the row exists in the delta lake already, but since you’re using an incremental column, ADF will handle this for you if configured properly.

    For deletes, it’s a bit more involved. You’ll need a way to identify rows that exist in the delta lake but no longer exist in the source. One way to do this is to use a join or lookup in your dataflow to compare the source and sink, then apply delete if true() for those missing rows. The Microsoft documentation on alter row transformations explains how to set up these conditions.

    A small tip: make sure your delta lake sink is configured correctly for these operations. Since you’re using an inline dataset, the table structure must match the source, and the delta lake must support transactional writes.

    Lastly, don’t worry if this feels overwhelming ADF expressions take some getting used to! Start with simple conditions and test each part of your pipeline step by step. If you run into errors, the debug mode in ADF is super helpful for seeing what’s happening at each stage.

    Hope this helps.

    Best regards,
    Alex
    P.S. If my answer help to you, please Accept my answer
    PPS That is my Answer and not a Comment
    https://ctrlaltdel.blog/
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.