Traverse all files in a folder structure and Automate copying of data to ADF of only new records

Andres Esteban 156 Reputation points
2022-04-12T10:35:48.58+00:00

Hello

Let me give you some background:

I have to copy data from an ADLGen2 to an Azure PostgreSQL table. The data stored in ADLGen2 is AVRO format data from an Event Hub. I have configured a DataFlow and a Pipeline that works perfectly and allows me to copy the data without problems every time I run a trigger.

192351-image.png This is the Dataflow taking as source only one record to make the tests

192219-image.png And this is the Pipeline

At this point I have some doubts because I need to automate the process.

The files in ADLGen2 are stored in a folder structure as follows: store name/container name/0/year/month/day/hour/minute/20.avro

Each time a record from the eventhub is saved, a new folder with this structure is created.

How can I make my Data Factory go through all the folders and how can I configure it to only copy new data into the PostgreSQL table?

Thank you in advance

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,617 questions
0 comments No comments
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 31,146 Reputation points Microsoft Employee
    2022-04-13T11:36:44.357+00:00

    Hi @Andres Esteban ,
    Thankyou for using Microsoft Q&A platform and posting your query.

    As I understand your question here, you are trying to copy data from ADLS to Postgre SQL and the data stored in the ADLS is in the format - 'store name/container name/0/year/month/day/hour/minute/20.avro'. You want to process only the new data which arrives in a newly created folder and copy the same to PostGre SQL table.
    So , you want to know how to fetch the latest and greatest data from the new folder. Please correct me if my understanding is incorrect.

    To achieve this, you can use one of the following approaches:

    Option1:

    1. Keep the source ADLS folder structure as storeName/containerName/0/Year-Month-DayTHour:Minute_20.avro so now the format of files would be:
      storeName/containerName/0/2022-04-13T09:36_20.avro
      storeName/containerName/0/2022-04-14T08:36_20.avro
      ....
      storeName/containerName/0/2022-05-01T01:36_20.avro

    Create an ADF pipeline and use Get Metadata activity to fetch all the files present in ContainerName/0 folder and then use LookUp activity having dataset pointing to PostGreSQL /Azure SQL and write SQL query to fetch the max value of the fileName. Example:

    Declare @GetMetadataOutput nvarchar(max)  
        set @GetMetadataOutput= '@{activity('Get Metadata1').output.childItems}'  
        Select * FROM OPENJSON(@GetMetadataOutput) WITH(FileNames NVARCHAR(200) '$.name')  
    
        select max(fileNames) as LatestFileName from (  
        Select * FROM OPENJSON(@GetMetadataOutput) WITH(FileNames NVARCHAR(200) '$.name'))a  
    

    192726-image.png

    1. Call dataFlow after the lookUp and pass the LatestFileName as the parameter in dataflow and use the same filePath in the source dataset.

    Option2:

    1. Write some custom code using Azure function or Azure Databricks to iterate over recent folder and then loop through the subfolders and get the latest file and pass the same in your dataflow. Please refer to this thread for more help: https://learn.microsoft.com/en-us/answers/questions/795517/get-latest-date.html

    Hope this will help. Please let us know if any further queries.


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you.
      Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

0 additional answers

Sort by: Most helpful