Store data in ADLS by partitioning files based on Year Month and Day using ADF

Praveenraj R K 61 Reputation points Microsoft Employee
2023-05-23T11:01:26.1666667+00:00

Hi All,

I need to perform the below steps. Please advice how to achieve these steps ?

  1. I need to query data from source azure SQL DB (data extraction from sql tables and no modification to be done in source system like creating stored procedures,etc..) and store data in ADLS by partitioning files (partition on the timestamp data available in tables and not on the utc datetime) based on Year, Month and Day using ADF. Could you please advice how to achieve this in ADF ?
  2. Also need to truncate and load the data in ADLS for the current and previous week data alone using ADF.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,517 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. VasimTamboli 4,410 Reputation points
    2023-05-23T13:55:13.2066667+00:00

    To achieve the steps you mentioned using Azure Data Factory (ADF), you can follow these guidelines:

    Create Linked Services:

    • Create a Linked Service for your source Azure SQL DB to establish a connection.
    • Create a Linked Service for your target Azure Data Lake Storage (ADLS) to define the destination.

    Create Datasets:

    • Create a dataset for your source Azure SQL DB, specifying the table(s) you want to extract data from.
    • Create a dataset for your target ADLS, defining the folder structure and partitioning scheme.

    Create Pipelines:

    • Create a pipeline to extract data from the source Azure SQL DB. Use the Copy Activity to copy data from the source dataset to the target ADLS dataset.
    • Configure the Copy Activity to use a dynamic file path for partitioning based on Year, Month, and Day. You can use expressions like @concat('/year=', formatDateTime(utcnow(), 'yyyy'), '/month=', formatDateTime(utcnow(), 'MM'), '/day=', formatDateTime(utcnow(), 'dd')).

    Add Activities for Truncation and Load:

    • To truncate and load the data for the current and previous week, you can add a Lookup Activity before the Copy Activity in the pipeline.
    • In the Lookup Activity, query the target ADLS to retrieve the latest date for which data is already present.
    • Use the retrieved date to filter the source data in the Copy Activity using a WHERE clause. For example, WHERE dateColumn >= @activity('LookupActivity').output.firstRow.latestDate.

    Schedule and Execute the Pipeline:

    • Set up a trigger or schedule for the pipeline to run at the desired frequency.
    • Execute the pipeline to start the data extraction, transformation, and loading process.

    By following these steps, you can query data from the source Azure SQL DB, partition the files in ADLS based on the specified timestamp column, and truncate/load data for the current and previous week using Azure Data Factory.