How to get the filenames of excel files with the content in it and output to single a csv/parquet file using Copydata in Azure Data Factory?

Shemeer Badarudeen Khadeeja Beevi 20 Reputation points
2023-05-03T09:07:14.0133333+00:00

I have a collection of many files with same data format starting from a particular range but with different and unique files names. These file names consists of at least 8 sets of data. I have to read from all these files and write them into one file as an output while getting an additional column with the filename for each of the rows of data exported from these excel files. I have to use the Copy Data activity for this purpose.

I have combined all of them into one file using the Copy Data activity but could not get the file names by any methods. Could some one please guide ?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,914 questions
0 comments No comments
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,386 Reputation points Microsoft Employee
    2023-05-03T21:51:43.47+00:00

    Hello @Shemeer Badarudeen Khadeeja Beevi

    Thanks for the question and using MS Q&A platform.

    Yu can use the Copy activity and while I am testing the implementation, i took the same dummy.csv and copied and renamed the file. The schema remains the same (in my case the data also remains the same) .
    User's image

    I have the 4 files with the same content

    User's image

    The key here is the to set the option for wildcard and use additional columns option.

    User's image

    On the sink side set the Copy behavior as "Merge Files"

    This is the output on the sink side .

    User's image

    Hope this helps

    Thanks

    Himanshu

    Please accept as "Yes" if the answer provided is useful , so that you can help others in the community looking for remediation for similar issues. 

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Sedat SALMAN 13,265 Reputation points
    2023-05-03T20:47:23.54+00:00

    Ref:

    https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy-portal

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview

    https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

    Create a new pipeline and add a "Get Metadata" activity. Configure the "Source" dataset to point to the folder containing the Excel files. Set the "Field list" property to "Child Items".

    Add a "ForEach" activity and connect the output of the "Get Metadata" activity to the input of the "ForEach" activity. Set the "Items" property of the "ForEach" activity to @activity('Get Metadata').output.childItems.

    Inside the "ForEach" activity, add a "Copy Data" activity. Configure the source dataset to use the same folder and set the "File" property to @item().name. This will read each Excel file one by one.

    Create a new "Mapping Data Flow" activity inside the "ForEach" loop. Configure the source dataset to point to the Excel files, set the "File" property to @item().name, and set the "Sheet" property to the sheet name containing the data.

    In the "Mapping Data Flow" activity, add a "Derived Column" transformation. Add a new column called "filename" and set its value to @item().name. This will add a new column to your data containing the Excel file's name.

    Add a "Sink" transformation in the "Mapping Data Flow" activity. Configure the "Sink" dataset to point to the output location (CSV or Parquet file). Set the "Mapping" property to map the columns from the source data along with the additional "filename" column.

    Connect the output of the "Copy Data" activity to the input of the "Mapping Data Flow" activity.