ADF Copy Activity to merge files in single folder based on name pattern

Question

ADF Copy Activity to merge files in single folder based on name pattern

Jatinder Luthra 130

Hello Folks,

I have named partitioned files in Azure Data Lake storage, which need merge. I believe copy activity is right option to do so.

The challenge is all files in single folder have specific naming convention based on which single copy activity for merge should combine files based on name pattern. Here is an example files list:

data_0_0_0.snappy.parquet_1
data_0_1_0.snappy.parquet_1
data_0_0_0.snappy.parquet_2
data_0_0_0.snappy.parquet_3
data_0_1_0.snappy.parquet_2
data_0_2_0.snappy.parquet_1
data_0_2_0.snappy.parquet_2

Expected Merge Outcome should have single file for each pattern as below:

data_0_0_0.snappy.parquet
data_0_1_0.snappy.parquet
data_0_2_0.snappy.parquet

Appreciate any advice to achieve this scenario.

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Accepted answer

1 additional answer

Your answer

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Answer 1

Hi Jatinder Luthra ,

Thankyou for using Microsoft Q&A platform and thanks for posting your question here.

As I understand you question , it seems you are trying to merge set of few files into one file using copy activity in ADF pipeline. Please let me know if that's not the correct understanding of your query.

You can leverage wildcard file pattern as well as pipeline parameter to achieve this requirement. However, I would really like to stress on the point that the source file extension i.e. parquet_1, parquet_2 etc is not a valid file extension. System won't be able to render correctly as it's an unrecognized extension.

However, the following solution for merging the files should work for your scenario:

Create a pipeline parameter say 'param'
Drag a copy activity in your ADF pipeline
In the source , select 'wildcard file path' in the filepath type and use this expression in the wildcard file path:

data_0_@{pipeline().parameters.param}_0.snappy.parquet_*

Now in the sink dataset, create a parameter named 'outputfilename' and point the dataset to the output container and in filename use the created parameter by providing this expression :

@dataset().outputfilename

Now , in sink settings, provide this expression in the parameter value:

data_0_@{pipeline().parameters.param}_0.snappy.parquet

Change the copy behaviour to Merge files
Now, execute the pipeline by providing param value as 0,1,2,3 in each run.

Note: Kindly make sure that the schema of files that needs to be merged together is same . Columnnames, column order , number of columns should be same for the merge to happen.

For repro , I have taken csv file instead of parquet. Kindly check the below video for your reference: mergesetoffiles1

Next run:

mergesetoffiles2

Hope it helps. Kindly accept the answer by clicking on Accept asnwer button and take the survey to mark the answer as helpful. Thankyou

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-07-20T17:31:19.61+00:00

Hi Jatinder Luthra ,

Just checking in to see if the above answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'
raamashaamy 0 Reputation points

2023-12-21T07:36:29.06+00:00

Thanks for the detailed explanation. It helps a lot.

Answer 2

Jatinder Luthra 130

Thanks @AnnuKumari-MSFT This is very helpful, and clear answer.

Appreciate it a lot.

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-07-22T03:00:33.27+00:00

Jatinder Luthra ,

Glad to know it helped. Thankyou
Ashwin Shankara Lingam 0 Reputation points

2023-11-27T03:34:06.7433333+00:00

When I usually use COPY activity to merge multiple files into single file, it is not generating MD5 hash value for the target file.

Share via

ADF Copy Activity to merge files in single folder based on name pattern

1 additional answer

Your answer