Merge files into single file

ClarissaJacquline 200 Reputation points
2023-11-08T05:01:58.9666667+00:00

I have a folder called ‘adls’ with 10,000 files. I retrieve all the filenames using the ‘get metadata’ activity and pass them to a ‘for each’ loop. During the initial execution, I need to process 1,000 files and merge them into a single file. In subsequent runs, I want to process another batch of 2,000 files based on the total file count. Is there a way in Azure Data Factory to process these files in batches?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
{count} votes

Answer accepted by question author
  1. Smaran Thoomu 32,525 Reputation points Microsoft External Staff Moderator
    2023-11-08T13:22:09.3666667+00:00

    Hi ClarissaJacquline,

    Thanks for using Microsoft Q&A.

    To Iterate and merge the batch of 1000 files you need to first group the file names/ path in a different file that need to be processed in a batch. the using list of files option in ADF copy activity merge the files.

    Follow the below process (I tried with total 5 files by grouping 2 files):

    1.Take a dataflow activity add new dataflow.

    • In source add the dataset where all 10000 files are located and in source options set Wildcard paths as * and with Column to store file name add file name column in data set. User's image
    • Take an aggregate transformation and in group by select filename column and in aggregates perform any aggregate task on any column.
      User's image
    • Then take a derived column activity to remove the first backslash from file name/path with expression dropLeft(filename,1)
      User's image
    • In sink add dataset where you need to store these files with unchecking the First row as header
    • Set Skip line count to 1
    • In setting set the File name option as Pattern and Pattern as file[1]000.csv
    • In mapping select only filename column.
    • In Optimize set the partitioning as dynamic range based on filename column and set the no of partition in your case 10 User's image
    • Output of the is as below:User's image
    1. Then use get metadata activity to get the list of files where each file contains list of files that need to be merged.
      User's image

    3.Then pass these files to foreach activity to loop over @activity('Get Metadata1').output.childItems.

    User's image

    1. In for each activity take copy activity
    • Set File path type is List of files in Path to files list set the dynamic way to get the files. User's image
    • In copy activity dataset add the data set that can tail files need to be merged. User's image
    • Add sink and set copy behaviour as merge files. User's image
    • Output: It will merge all the list of files which we got from dataflow User's image

    I hope this helps!


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.