Want to filter columns from multiple parquet files and then combine all these files to a single file using mapping dataflows in adf

Amar Agnihotri 926 Reputation points
2022-12-01T16:34:59.587+00:00

Hi,
I am having 11 parquet files in datalake
266281-image.png

Each file is having 260 columns and approx 500k rows. I have to pull all these 11 files and then filter the number of columns down to 60 only from each file and then have to combine all these files in a single file.

I am trying to achieve this in mapping dataflows like this.

I created 11 incoming streams for each file and the used UNION to union all and then selected only 60 useful columns in the mapping section ( I used manual mapping ) and then used the SINK with output to a single file option.

266239-image.png

These are my settings for the incoming streams
266282-image.png
266283-image.png
266215-image.png
266216-image.png
266274-image.png

These are the settings in UNION
266227-image.png
266255-image.png

These are the settings in the SINK
266198-image.png
266199-image.png
266209-image.png
266240-image.png

Now when i ran this dataflow it is taking infinite time even when i used 64 Cores in dataflow activity pipeline. I believe it is happening because it is first doing UNION of all the files which is resulting into a large file of 260 columns and 40M rows and then filtering columns is getting done on this gigantic file.

I want to change my logic now. I want to filter the columns down to 60 first on all the 11 files and then i want to do union on all these files having only 60 columns.

Can anyone suggest the work path to achieve this in mapping dataflows or is there any better way to achieve this in an efficient and cost effective way.

Thanks

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,625 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Nasreen Akter 10,811 Reputation points Volunteer Moderator
    2022-12-01T20:37:21.937+00:00

    Hi @Amar Agnihotri ,

    I was wondering, if there is a reason to use 11 different Dataset as a Source?

    You can simply use 1 dataset for all the files and then have a DataFlow-->SELECT activity to minimize the number of columns >> then do all the other ACTIVITIES if there is any >> then SINK. Hope this helps, thanks!

    266334-image.png


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.