Ideas to generate "list of files" from ADLS gen 2 (csv files) for ADF copy data activity

Saru Thiagarajan 31

Data Factory/Synapse copy data activity source has a feature to point to a text file that lists each file that we want to copy to the sink. The functionality works great but I'm breaking my head as to how I can generate that text-file in the first place using the files in the blob storage. It worked great because I created the file list manually and uploaded to the blob but that ain't going to work in end-2-end flow.

In the past, I've written shell script to generate the file-list and executed it before the session/mapping that does the actual load to staging tables etc (you know which ETL tool I'm talking about) but how can we do it in the Azure ADF landscape?

Thinking of leveraging get metadata activity on the container, looping through each and inserting into a database. Then having a stored proc to group them into respective "file-list" but how can I make ADF create a blob storage file with list of files in that? Another option is to merge all files using the same metadata activity but this seems to me like a simple feature and I don't mean to beat a dead horse, but I still don't have a clear design path for this.

Any guidance is greatly appreciated. It seems to me like a simple feature.

2 answers

HarithaMaddi-MSFT 10,126 Reputation points

2021-01-28T09:24:37.457+00:00
Hi @Saru Thiagarajan ,

Welcome to Microsoft Q&A Platform. Thanks for posting the query.

One approach I can think is as below using array and string variables to store the file names which can be later copied into a blob file using copy activity.

Please let us know for further queries and we will be glad to assist.

--

Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

Want a reminder to come back and check responses? Here is how to subscribe to a notification.
Please sign in to rate this answer.
Saru Thiagarajan 31 Reputation points

2021-01-28T16:03:21.22+00:00

Thanks. Much appreciate your approach and the gif. Looks like this combination does work. Is it possible to slow down the gif please especially the Copy data1 part. I think you are giving us good direction. Your approach is that

We create a two variables at the pipeline level ( filelist String and array Array)

Next in Get Metadata activity point to the adsl/blob or any location, and add a field list to get the "Child items"

The usual for each to loop over to get the output.childitems
3.a create an activity under for each to "append variable". Select name to be array and Value to be @item().name

Next connect a Set Variable

a Select Name to be filelist from the drop down and Value @fr (variable('array', ',')

I just want to make sure that we document it well for others.

Saru Thiagarajan 31 Reputation points

2021-01-28T21:21:06.36+00:00

To give you the use case :
Say we have a container called foo with the below files
sales-01.csv
sales-02.csv
sales-03.csv

Say the list-file to be used in the copy data activity has to be "sales.txt" (on the same or different container.) The result will be a file called
sales.txt and in that file you will have
sales-01.csv
sales-02.csv
sales.03.csv

Hope this example kind of helps. To take it to the next level, say I have product-01.csv, product-02.csv in the same "foo" container then I should see another file called prod.txt and in that file, we should have
product-01.csv
product-02.csv

I will configure to fire sales pipeline and product pipeline when prod.txt and sales.txt are created in the "foo" container. Hope this helps.

HarithaMaddi-MSFT 10,126 Reputation points

2021-01-29T06:34:35.723+00:00

Thanks @Saru Thiagarajan for writing the verbatim for GIF as it will be helpful for community reading the post. Please find configuration snaps from copy activity and also attached the JSON of the pipeline.

61698-pipelinejson.txt

Also, thanks for sharing the use case. Please let us know for further queries and we will be glad to assist.

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.

Saru Thiagarajan 31 Reputation points

2021-01-29T13:35:38.95+00:00

Awesome! Would you also help share the screenshot of DelimitedText3 (source dataset) and DelimitedText2 (sink dataset)? This way we document all artifacts. The more I type, I think this use case and your solution is a good candidate for the product documentation under how to section. Let me try your solution and will keep you posted.

HarithaMaddi-MSFT 10,126 Reputation points

2021-02-02T09:59:54.277+00:00

Thanks @Saru Thiagarajan for confirming the details. DelimitedText3 (source dataset) is a dummy one and can refer to any file as our required value is in additional columns. DelimitedText2 refers to target folder in blob where we want to place the list of files data.

Please let us know for further queries and we will be glad to assist. Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.

HarithaMaddi-MSFT 10,126 Reputation points

2021-02-03T06:20:41.883+00:00

Hi @Saru Thiagarajan ,

We have not received a response from you. Please suggest if above suggested approach is helpful. Otherwise, let us know and we will continue to engage with you on the query.

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.

HarithaMaddi-MSFT 10,126 Reputation points

2021-02-04T05:28:42.77+00:00

Hi @Saru Thiagarajan ,

We still have not heard back from you. Following up to check if above suggested approach is helpful. Otherwise, let us know and we will continue to engage with you on the query.

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.
Sign in to comment
Saru Thiagarajan 31 Reputation points

2021-04-02T20:48:30.097+00:00

My sincere apologies for the delayed response.
I went with a dataflow. The sink was of type ADLS gen2 with schema drift enabled and on the settings, I choose the file name option to be "Name file as a column data".
Please sign in to rate this answer.

0 comments No comments
Sign in to comment