Azure Data Factory Upsert Azure SQL Database Sink

Deepak Lankapalli 0 Reputation points
2023-05-11T21:56:46.5633333+00:00

I am using Copy activity to upsert data into my desitnation table in Azure SQL Database based on primary keys.

ADF creates Interim tables in the specified schema which is dbo in this case. It also adds a column named BatchIdentifier to the interim table. Data is first loaded to the interim table followed by an update /insert into target table.

How does ADF determine the batch size when it does the insert and update . Is there a way to control the batchidentifier parameters in the insert and update query? I have table which has about 30 million records and it seems to be updating in a batch of 10 million records each (BatchIdentifier >= 1 and BatchIdentifier < 10000001) ,(BatchIdentifier >= 10000001 and BatchIdentifier < 20000001) and so on.

BatchIdentifier >= @start

and BatchIdentifier < @end

Azure SQL Database
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,639 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-05-12T23:24:14.6733333+00:00

    Hello Deepak Lankapalli,

    Welcome to the MS Q&A platform.

    In ADF, the writeBatchSize property controls the number of rows inserted in a batch during the Copy Activity. You can adjust this value to control the batch size for the insert and update operations.

    WritebatchSize:
    Number of rows to insert into the SQL table per batch. The allowed value is integer (number of rows). By default, the service dynamically determines the appropriate batch size based on the row size.

    User's image

    The BatchIdentifier column is an internal mechanism used by ADF, and there is no direct way to control its parameters in the insert and update query.

    However, If you want to control the batch size and the BatchIdentifier parameter, consider using a stored procedure in the sink. You can define the stored procedure to take the BatchIdentifier parameter as input and use it to update or insert data into the target table.

    You can then configure Copy Activity to use the stored procedure as the sink.

    Example:

    CREATE PROCEDURE [dbo].[InsertOrUpdateData]
        @BatchIdentifier INT
    AS
    BEGIN
        SET NOCOUNT ON;
    
        MERGE INTO [dbo].[TargetTable] AS target
        USING [dbo].[InterimTable] AS source
        ON target.[PrimaryKey] = source.[PrimaryKey]
        WHEN MATCHED THEN
            UPDATE SET
                target.[Column1] = source.[Column1],
                target.[Column2] = source.[Column2],
                ...
        WHEN NOT MATCHED THEN
            INSERT ([PrimaryKey], [Column1], [Column2], ...)
            VALUES (source.[PrimaryKey], source.[Column1], source.[Column2], ...);
    END
    

    Please note: It is not recommended to modify this parameter as it is an internal parameter used by ADF to manage the interim table.

    As per your ask, this is how we can control the batchidentifier parameter in the insert and update query.

    I hope this helps. Please let me know if you have any further questions.

    Reference document:
    https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database?tabs=data-factory#copy-activity-properties


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.