Want to pull files from nested S3 bucket folders and want to save them with a customized name in datalake (Case -All the files in S3 bucket different folders are having the same name)

Amar Agnihotri 926 Reputation points
2022-12-13T06:30:48.47+00:00

Hi,
I have this hierarchy of folders in S3 bucket
269911-image.png

each folder contains parquet files but with same name.
like month=1 folder have these files
269921-image.png

and month=2 folder have these files
269931-image.png

I have built this pipeline so far -
269941-image.png
269952-image.png

First of all my metadata activity is grabbing the folder name as
269954-image.png

and then my foreach activity is iterating through each item coming from metadata activity to get inside the folder name and then copy activity is using that folder name and then i tried using the wild card path to copy all the parquet files to a single folder in datalake
269932-image.png
This is my dynamic expression inside wildcard path

@markus.bohland@hotmail.de ('athena/daily_cost_report/daily_cost_report/year=2022/',item().name,'/')

My copy activity is working fine but since for the different month folders files names are same so every time my copy activity is replacing the files inside the sink .

I want to save the files as
The file name coming from the s3 bucket are like this -

daily_cost_report-00001.snappy.parquet

Now i want to save these files with some suffix .

Like for 2022 Jan files
daily_cost_report-00001_2022_01.snappy.parquet
daily_cost_report-00002_2022_01.snappy.parquet
daily_cost_report-00003_2022_01.snappy.parquet
For 2022 feb Files
daily_cost_report-00001_2022_02.snappy.parquet
daily_cost_report-00001_2022_02.snappy.parquet
and so on.

I am not able to get the file names inside foreach. How can we achieve this ?

Please suggest @MartinJaffer-MSFT

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,236 Reputation points
    2022-12-13T22:26:17.287+00:00

    Ahh I was under the impression we were targeting the folders, and leaving the filenames as is @Amar Agnihotri

    Changing the filenames between source and sink requires copying each file individually, to assign the names individually. This is done in parameterizing sink dataset. I don't think it can be done all at once using wildcards. Note that wildcard filepath is available only is source and not sink.

    Dataflow has more options when it comes to naming sink.

    Using Get Metadata to get list of filenames and iterating over that, is something you probably want to avoid. I'm not coming up with many workarounds so far.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.