ADF Copy Activity to merge files in single folder based on name pattern

Jatinder Luthra 130 Reputation points
2023-07-17T22:15:26.79+00:00

Hello Folks,

I have named partitioned files in Azure Data Lake storage, which need merge. I believe copy activity is right option to do so.

The challenge is all files in single folder have specific naming convention based on which single copy activity for merge should combine files based on name pattern. Here is an example files list:

  • data_0_0_0.snappy.parquet_1
  • data_0_1_0.snappy.parquet_1
  • data_0_0_0.snappy.parquet_2
  • data_0_0_0.snappy.parquet_3
  • data_0_1_0.snappy.parquet_2
  • data_0_2_0.snappy.parquet_1
  • data_0_2_0.snappy.parquet_2

Expected Merge Outcome should have single file for each pattern as below:

  • data_0_0_0.snappy.parquet
  • data_0_1_0.snappy.parquet
  • data_0_2_0.snappy.parquet

Appreciate any advice to achieve this scenario.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator
    2023-07-19T05:59:21.8433333+00:00

    Hi Jatinder Luthra ,

    Thankyou for using Microsoft Q&A platform and thanks for posting your question here.

    As I understand you question , it seems you are trying to merge set of few files into one file using copy activity in ADF pipeline. Please let me know if that's not the correct understanding of your query.

    You can leverage wildcard file pattern as well as pipeline parameter to achieve this requirement. However, I would really like to stress on the point that the source file extension i.e. parquet_1, parquet_2 etc is not a valid file extension. System won't be able to render correctly as it's an unrecognized extension.

    However, the following solution for merging the files should work for your scenario:

    • Create a pipeline parameter say 'param'
    • Drag a copy activity in your ADF pipeline
    • In the source , select 'wildcard file path' in the filepath type and use this expression in the wildcard file path:
    data_0_@{pipeline().parameters.param}_0.snappy.parquet_*
    
    • Now in the sink dataset, create a parameter named 'outputfilename' and point the dataset to the output container and in filename use the created parameter by providing this expression :
    @dataset().outputfilename
    
    • Now , in sink settings, provide this expression in the parameter value:
    data_0_@{pipeline().parameters.param}_0.snappy.parquet
    
    • Change the copy behaviour to Merge files
    • Now, execute the pipeline by providing param value as 0,1,2,3 in each run.

    Note: Kindly make sure that the schema of files that needs to be merged together is same . Columnnames, column order , number of columns should be same for the merge to happen.

    For repro , I have taken csv file instead of parquet. Kindly check the below video for your reference:mergesetoffiles1

    Next run:

    mergesetoffiles2

    Hope it helps. Kindly accept the answer by clicking on Accept asnwer button and take the survey to mark the answer as helpful. Thankyou

    3 people found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Jatinder Luthra 130 Reputation points
    2023-07-21T19:36:39.35+00:00

    Thanks @AnnuKumari-MSFT This is very helpful, and clear answer.

    Appreciate it a lot.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.