How can I do an incremental copy from Azure SQL DB to Snowflake?

Oleg 0 Reputation points
2024-03-12T11:02:47.3866667+00:00

I need to upload a table from Azure SQL DB to Snowflake using Azure Factory. Currently, I am copying the entire Events table (about 2GB) from the BlaBla database using the "copy data" pipeline with source and destination. The table is copied without creating a CSV file.

However, I am struggling to find a solution to perform incremental loading so that only newly added rows are copied instead of the whole database. I want to set up a trigger every 5 minutes so that the rows that have appeared in the Events table are copied to the Snowflake tables without any duplicates.

I am also wondering if there are any options available to access the API to send the data directly to Snowflake. Could someone show me an example if possible?

Any help would be appreciated, as my ideas are gradually running out.

Attachments included in the original question have been preserved.

Azure SQL Database
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 20,176 Reputation points
    2024-03-12T15:33:40.2766667+00:00

    I think you need a way to identify new or changed data since the last load :

    • If your Events table has a timestamp column that records the row insertion or last update time, you can use this to identify new rows
    • If the table has an auto-incrementing primary key, you can track the last copied key value
    • Azure SQL Database supports CDC and CT, which can be used to track changes to the data

    For Azure Data Factory configuration :

    If you're using a timestamp or an auto-incrementing primary key for tracking changes, you can create a watermark table in Azure SQL DB. This table stores the last copied timestamp or ID value.

    Use a Lookup activity in your ADF pipeline to retrieve the last copied watermark value from the watermark table or a variable.

    Then, configure the source of your Copy Data activity to use a query that selects only the rows with a timestamp greater than the last watermark value or where the primary key is greater than the last copied ID.

    For example, if using a timestamp column:

    
    SELECT * FROM Events WHERE EventTimestamp > @lastWatermark
    
    

    Or, if using an auto-incrementing ID:

    
    SELECT * FROM Events WHERE EventID > @lastCopiedID
    

    After copying the new rows, use a Stored Procedure activity or another Lookup to update the watermark table with the new last copied timestamp or ID.

    To automate this process every 5 minutes, configure a Tumbling Window Trigger in ADF that initiates the pipeline at your specified frequency.

    For direct API access, Snowflake offers Snowpipe, a service for continuously loading data. However, integrating Snowpipe directly with Azure SQL DB might require an intermediary service or custom code to extract the data and pass it to Snowpipe's REST API. This approach would be more complex and outside the standard ADF capabilities.