Add multiple files according to a file list into a single source in DataFlow

Nofar Hameiri 0 Reputation points
2025-02-02T18:25:37.6433333+00:00

I have a folder that hold all existing and new files that I get from a customer.

There are two types of files, each describing a diffrent table- Provider table and Network table. As the amount of data is large each table is divided into multiple files.
Every time the customer adds new data, he adds a "control file" that indicated he is done adding data, and states the list of new file I should process.
I want to read the list of files and build the tables with the new files and process them in DataFlow.
Control file example:

FILENAME|ROWS
providers_01_01010122.txt|100
providers_02_010101322.txt|122

network_01_010101122.txt|200
network_02_11122.txt |111

network_03_210101122.txt|150

I want all the files starting with "providers" to construct the provider dataset and the ones starting with "network" to construct the network dataset.
I tried building a pipeline and using "Lookup" activity to read the file, than "for each" activity to iterate over the file names and "copy" activity to copy each file into its correlated data set by using "if else" activity to check if the file name match the pattern, it did not work properly.
How should I dynamically construct each time the control file is added the desired datasets and process them in the data flow?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,215 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Pinaki Ghatak 5,570 Reputation points Microsoft Employee
    2025-02-03T10:51:15.8066667+00:00

    Hello @Nofar Hameiri

    To dynamically construct the desired datasets and process them in the data flow, you can use the following 7-steps:

    1. Create a pipeline with a Lookup activity to read the control file. In the Lookup activity, set the file path to the control file and the file format to DelimitedText. In the DelimitedText format settings, set the delimiter to "|" and the first row as header.
    2. Add a Filter activity after the Lookup activity to filter the rows based on the file name prefix. In the Filter activity, set the condition to startsWith(item().FILENAME, "providers") for the provider dataset and startsWith(item().FILENAME, "network") for the network dataset.
    3. Add a ForEach activity after the Filter activity to iterate over the filtered rows. In the ForEach activity, set the Items property to @activity('Filter').output.value to iterate over the filtered rows.
    4. Inside the ForEach activity, add a Copy activity to copy each file into its correlated dataset. In the Copy activity, set the source dataset to the file path and the sink dataset to the corresponding dataset based on the file name prefix. You can use dynamic content to set the dataset name based on the file name prefix.
    5. Add a Wait activity after the Copy activity to wait for the file to be fully copied before processing it in the data flow.
    6. Add a Data Flow activity after the Wait activity to process the data in the datasets. In the Data Flow activity, you can use the source dataset and sink dataset that you created in the Copy activity.
    7. Run the pipeline to process the data in the datasets.

    By using these above steps, you can dynamically construct the desired datasets and process them in the data flow each time the control file is added.


    I hope that this response has addressed your query and helped you overcome your challenges. If so, please mark this response as Answered. This will not only acknowledge our efforts, but also assist other community members who may be looking for similar solutions.

    0 comments No comments

  2. phemanth 13,625 Reputation points Microsoft Vendor
    2025-02-03T10:52:44.58+00:00

    @Nofar Hameiri

    Welcome to the Microsoft Q&A forum.

    It seems like you're dealing with a complex data integration task. Here's a step-by-step approach to dynamically construct and process your datasets in Dataflow:

    Here's a more detailed breakdown of the steps:

    Step 1: Read the Control File

    • Use a Lookup activity to read the control file and store the result in a dataset.

    Shows the UI for a Lookup activity.

    Step 2: Parse the Control File

    • In the Data Flow activity, use a Source transformation to read the control file dataset.

    Derived column settings

    • Use a Derived Column transformation to split the FILENAME and ROWS columns.

    Step 3: Filter Files by Type

    • Use a Conditional Split transformation to create two streams: one for files starting with providers and one for files starting with network.

    Step 4: ForEach Activity

    • Use a ForEach activity to iterate over the list of files in each stream.

    Step 5: Copy Activity

    • Inside the ForEach activity, use a Copy Data activity to copy each file to its respective dataset.

    Step 6: Union Activity

    • Use a Union transformation to combine all the provider files into a single dataset and all the network files into another dataset.

    Step 7: Sink to Destination

    • Use a Sink transformation to write the combined datasets to your desired destination.

    I hope the above steps will resolve the issue, please do let us know if issue persists. Thank you

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.