question

tsaru72 avatar image
0 Votes"
tsaru72 asked tsaru72 answered

Ideas to generate "list of files" from ADLS gen 2 (csv files) for ADF copy data activity

Data Factory/Synapse copy data activity source has a feature to point to a text file that lists each file that we want to copy to the sink. The functionality works great but I'm breaking my head as to how I can generate that text-file in the first place using the files in the blob storage. It worked great because I created the file list manually and uploaded to the blob but that ain't going to work in end-2-end flow.

In the past, I've written shell script to generate the file-list and executed it before the session/mapping that does the actual load to staging tables etc (you know which ETL tool I'm talking about) but how can we do it in the Azure ADF landscape?

Thinking of leveraging get metadata activity on the container, looping through each and inserting into a database. Then having a stored proc to group them into respective "file-list" but how can I make ADF create a blob storage file with list of files in that? Another option is to merge all files using the same metadata activity but this seems to me like a simple feature and I don't mean to beat a dead horse, but I still don't have a clear design path for this.

Any guidance is greatly appreciated. It seems to me like a simple feature.

azure-data-factoryazure-data-lake-storage
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HarithaMaddi-MSFT avatar image
0 Votes"
HarithaMaddi-MSFT answered HarithaMaddi-MSFT commented

Hi @tsaru72,

Welcome to Microsoft Q&A Platform. Thanks for posting the query.

One approach I can think is as below using array and string variables to store the file names which can be later copied into a blob file using copy activity.

61363-filenameslistblob.gif

Please let us know for further queries and we will be glad to assist.


  • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification.




· 7
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks. Much appreciate your approach and the gif. Looks like this combination does work. Is it possible to slow down the gif please especially the Copy data1 part. I think you are giving us good direction. Your approach is that

  1. We create a two variables at the pipeline level ( filelist String and array Array)

  2. Next in Get Metadata activity point to the adsl/blob or any location, and add a field list to get the "Child items"

  3. The usual for each to loop over to get the output.childitems

  4. create an activity under for each to "append variable". Select name to be array and Value to be @item().name
    4.Next connect a Set Variable

  5. Select Name to be filelist from the drop down and Value @join(variable('array', ',')

I just want to make sure that we document it well for others.




1 Vote 1 ·

To give you the use case :
Say we have a container called foo with the below files
sales-01.csv
sales-02.csv
sales-03.csv

Say the list-file to be used in the copy data activity has to be "sales.txt" (on the same or different container.) The result will be a file called
sales.txt and in that file you will have
sales-01.csv
sales-02.csv
sales.03.csv

Hope this example kind of helps. To take it to the next level, say I have product-01.csv, product-02.csv in the same "foo" container then I should see another file called prod.txt and in that file, we should have
product-01.csv
product-02.csv

I will configure to fire sales pipeline and product pipeline when prod.txt and sales.txt are created in the "foo" container. Hope this helps.

1 Vote 1 ·

Thanks @tsaru72 for writing the verbatim for GIF as it will be helpful for community reading the post. Please find configuration snaps from copy activity and also attached the JSON of the pipeline.

61696-image.png

61697-image.png

61698-pipelinejson.txt

Also, thanks for sharing the use case. Please let us know for further queries and we will be glad to assist.

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.

0 Votes 0 ·
image.png (72.3 KiB)
image.png (40.9 KiB)
pipelinejson.txt (6.7 KiB)
tsaru72 avatar image tsaru72 HarithaMaddi-MSFT ·

Awesome! Would you also help share the screenshot of DelimitedText3 (source dataset) and DelimitedText2 (sink dataset)? This way we document all artifacts. The more I type, I think this use case and your solution is a good candidate for the product documentation under how to section. Let me try your solution and will keep you posted.

0 Votes 0 ·
Show more comments
tsaru72 avatar image
0 Votes"
tsaru72 answered

My sincere apologies for the delayed response.
I went with a dataflow. The sink was of type ADLS gen2 with schema drift enabled and on the settings, I choose the file name option to be "Name file as a column data".

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.