Sequential file processing in ADF

Bhushan Gawale 331 Reputation points
2021-06-30T07:46:44.19+00:00

Hi,

We have been working on a use case where we would need to process files sequentially using ADF. Sharing the context and overview of the use case as below

Some business process would end up generating and pushing files to the storage account, number of files and the size could vary and these files needs to be processed in a sequence of their arrival and that's where ADF comes into the picture to apply transformations before pushing content to the sink.

To elaborate more - Assume that the files below would arrive in below sequence
file1.csv, file2.csv, file3.csv, file4.csv and file5.csv

So file1.csv needs to be processed first before file2.csv processing starts and so on.. Also, the sequencing has to be maintained in case of failures. E.g. if file1.csv and file2.csv were processed fine and an error is encountered while processing file3.csv, then the pipeline should stop. When rectified file3.csv is uploaded by the upstream process again, the pipeline needs to be executed to process only pending files with original sequence i.e. updated file3.csv, file4.csv and file5.csv and that's where the problem lies as we do not see any orchestration built in to ADF as a platform to handle such scenario.

Currently, we are leveraging an option of blob triggers and can see that the processing ADF pipeline getting triggered multiple times as multiple files are being dropped into the storage account, but have to write a lot of custom logic to ensure sequence is maintained in some persistent storage, look it up every time as well as to handle failure and re-runs to honor original sequence (as explained above).

Looking for inputs if there is any better way to handle such orchestrations in ADF? Is there anything in mapping data flows that can be leverage to address the original problem and could help in processing files in a sequence?

Thanks in advance.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} vote

2 answers

Sort by: Most helpful
  1. Nandan Hegde 36,146 Reputation points MVP Volunteer Moderator
    2021-06-30T08:43:42.58+00:00

    Hey,
    You can use a combination of Get meta data activity and foreach to achieve your goal.
    Via Get meta data activity, you can map you folder and the list of child items would act as the input for the foreach activity (for which we can map sequential execution)
    Since Foreach would proceed for all iterations irrespective a failed one, you can add a variable /custom logic at the beginning of foreach to validate if the previous file was processed properly or not)
    And at the end of foreach , you can have a file archival logic.

    0 comments No comments

  2. Kalantri, Payal 0 Reputation points
    2024-02-08T06:52:19.1066667+00:00

    Hi , In support to answer provided by @Nandan Hegde : If you are processing these BLOBS from some folders then you can check SEQUENTIAL property of For Loop in ADF and all files will be processed sequentially instead of parallely. P.S.: This will work only when all files are in different folders. It worked for me as we were using MFT to transfer file which created unique folders for each file transfer. you flow should look like: 1.get metadata activity to fecth required folders then use FOR LOOP (Sequential) in whihc you will have below activities : Another Get Metadata to fetch filename from 1st folder ->Filter or IF activity to check your deisred filename(wildcard) --> Copy activity to load data to your table --> Copy activity for archival of 1st file --> Delete activity to delete processed file from 1st folder NOTE : WHEN YOU MARK FOR LOOP AS SEQUENTIAL ALL YOUR FILES WILL BE PROCESSED ON FIRST COME FIRST SERVE BASIS. So , next fileprocessing will start exactly same as above for other existing files in another folders.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.