How to get the filenames of excel files with the content in it and output to single a csv/parquet file using Copydata in Azure Data Factory?

Question

I have a collection of many files with same data format starting from a particular range but with different and unique files names. These file names consists of at least 8 sets of data. I have to read from all these files and write them into one file as an output while getting an additional column with the filename for each of the rows of data exported from these excel files. I have to use the Copy Data activity for this purpose.

I have combined all of them into one file using the Copy Data activity but could not get the file names by any methods. Could some one please guide ?

Accepted Answer

Hello @Shemeer Badarudeen Khadeeja Beevi

Thanks for the question and using MS Q&A platform.

Yu can use the Copy activity and while I am testing the implementation, i took the same dummy.csv and copied and renamed the file. The schema remains the same (in my case the data also remains the same) .
User's image

I have the 4 files with the same content

User's image

The key here is the to set the option for wildcard and use additional columns option.

User's image

On the sink side set the Copy behavior as "Merge Files"

This is the output on the sink side .

User's image

Hope this helps

Thanks

Himanshu

Please accept as "Yes" if the answer provided is useful , so that you can help others in the community looking for remediation for similar issues.

Answer

Ref:

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy-portal

https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview

https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

Create a new pipeline and add a "Get Metadata" activity. Configure the "Source" dataset to point to the folder containing the Excel files. Set the "Field list" property to "Child Items".

Add a "ForEach" activity and connect the output of the "Get Metadata" activity to the input of the "ForEach" activity. Set the "Items" property of the "ForEach" activity to @activity('Get Metadata').output.childItems.

Inside the "ForEach" activity, add a "Copy Data" activity. Configure the source dataset to use the same folder and set the "File" property to @item().name. This will read each Excel file one by one.

Create a new "Mapping Data Flow" activity inside the "ForEach" loop. Configure the source dataset to point to the Excel files, set the "File" property to @item().name, and set the "Sheet" property to the sheet name containing the data.

In the "Mapping Data Flow" activity, add a "Derived Column" transformation. Add a new column called "filename" and set its value to @item().name. This will add a new column to your data containing the Excel file's name.

Add a "Sink" transformation in the "Mapping Data Flow" activity. Configure the "Sink" dataset to point to the output location (CSV or Parquet file). Set the "Mapping" property to map the columns from the source data along with the additional "filename" column.

Connect the output of the "Copy Data" activity to the input of the "Mapping Data Flow" activity.

Share via

How to get the filenames of excel files with the content in it and output to single a csv/parquet file using Copydata in Azure Data Factory?

1 additional answer