How do I upsert data into a Data Lake via Azure Data Factory Data Flow based on partitions?

Netty 26 Reputation points
2021-07-15T17:26:18.873+00:00

Hello there,

How do I continually update the data in my Azure Data Lake to upsert new data via Azure Data Factory based on the partitions (client ID and date) of the container in the Data Lake? Data we collect over the span of a rolling week could change throughout that week, therefore it's critical that for each day from day 1 thru day 7 post the day of initial processing can subsequently process updated data into the Data Lake. It seems as though the only option is the "Clear the Folder" in the "Sink" job in an Azure Data Lake Data Flow, but I only need to clear the folders for the partitions that the Data Flow needs to run for. Without selecting "Clear the Folder" my data gets duplicated within the Data Lake.

To give an example, if today (2021-07-15) I collect data for yesterday (2021-07-14), I can write that to the Data Lake no problem via ADF Data Flow. But if in 2 days (2021-07-17), the data from yesterday (2021-07-14) gets updated for a given client, then I want to rewrite the data for yesterday (2021-07-14) to reflect the new data collected.

This is essentially the "upsert" function if I were writing this data to an Azure SQL DB OR using INSERT OVERWRITE if I were manually uploading this data in Hive/Spark SQL.

Does anyone know if this is possible via Azure Data Factory? Let me know if you need me to provide more context or an additional example.

Best wishes,
Netty

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,490 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,870 questions
{count} vote

Accepted answer
  1. ShaikMaheer-MSFT 38,456 Reputation points Microsoft Employee
    2021-07-20T07:42:09.767+00:00

    Hi @Netty ,

    Thank you for posting query on Microsoft Q&A Platform.

    I have implemented a sample pipeline for your requirement. Please check below detailed explanation and follow same in your requirement too.

    Step1: Created a pipeline parameters called "client_id" & "dates"
    client_id --> To hold your client id for which you want to run execution.
    dates --> Array of your dates values for which you want to run execution.
    116080-parameters.gif

    Step2: ForEach activity. Pass your dates array in to Foreach activity.
    116168-foreach.gif

    Step3: Inside ForEach activity, Use Copy activity for copy source data in your target storage. Here, on Target we want to get folder path in below format. Hence I use parameterized dataset as Sink.
    my-data-lake/my-container/client_id=<clinet_id>/date=yyyy-MM-dd/*.json
    Expression used for dynamic path: client_id=@{pipeline().parameters.client_id}/date=@{item()}
    116159-copy1.gif

    Hope this will help. Thank you

    ----------------------------------

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.
    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.