How to split an Azure SQL Database Table based on Rows

Saksham Somani 20 Reputation points
2024-04-03T20:51:42.5833333+00:00

Hello,
I am new to learning Azure and am currently working with Azure Data Factory. I came across a use-case where I have a table in Azure SQL Database. This table has multiple columns, one of which is Region.

  1. I want to use an ADF pipeline to Split the table into different CSV files or SQL tables based on the value in the Region Column.
    ex- If the values in the Region column values lie between 1-10, it should create 10 CSV files (1 for each region)
  2. I want to make sure that if the Source table is altered (Insertion, Update, Deletion of record) the Sink files also get updated.
Azure SQL Database
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 20,176 Reputation points
    2024-04-03T23:28:14.0466667+00:00

    For CSV:

    • Create a Blob Storage dataset with dynamic content in the file path to incorporate the "Region" value, allowing for different files per region.

    For SQL Tables:

    • Create a dataset for your Azure SQL Database with dynamic content in the table name to include the "Region" value, creating separate tables per region.

    Then, use a Lookup activity in your pipeline to retrieve distinct "Region" values from your source table. This activity will output a list of regions that you can iterate over.

    Wrap a Copy Data activity within a ForEach activity.

    The ForEach should iterate over the distinct "Region" values obtained from the Lookup activity. Pass each value as a parameter to both your source and sink datasets, ensuring data is filtered and stored according to the "Region".

    To maintain synchronization with your source table, consider implementing a trigger-based mechanism by enabling change tracking on your Azure SQL Database. You can use a stored procedure or a query to identify changes (insertions, updates, deletions) since the last pipeline run.

    And you can configure your pipeline to perform incremental loads instead of full refreshes to only process rows that have been inserted, updated, or deleted since your last pipeline execution.


0 additional answers

Sort by: Most helpful