Hi @Dinesh Prajapati ,
Thanks for using Microsoft Q&A forum and posting your query.
In order to remove duplicate records yes you can utilize Aggregate transformation. But the trick part in this requirement is to create a dynamic/generic flow which can help process any CSV file irrespective of schema changes. Please follow below steps to achieve the same:
-
- First use a derived column transformation to create a column that holds the string array of drifted column names (expression:
columnNames('source1', true()))
- First use a derived column transformation to create a column that holds the string array of drifted column names (expression:
- Then use a select transformation to select on the newly created column which holds the array of column names.
- Then use
sink cache
to write data the array columnNames into Spark cache instead of data store so that we can use these values in a different stream where we implement the aggregation to remove duplicates.
- Next add a data source in a new stream pointing to your folder where you have the excel files and make sure to check box
Allow schema drift
as you are having multiple excel files which uses this data flow for transforming the data.
-
- Then have a derived column transformation to create a group by column for doing the aggregate in the subsequent transformation. In this column in dynamic expression we will refer to the cached Column names from previous stream using this expression:
array(byNames(sinkCacheColumnNames#output().src_ColumnNames, 'source2'))
-
- Now have an aggregate transformation where we will remove duplicate columns grouping by
AggregateColumn
created in previous derived column transformation as shown below, and then InAggregates
settings, add column pattern as below:
- Now have an aggregate transformation where we will remove duplicate columns grouping by
- Then finally have a sink transformation and point to your desired sink and configure rule based mapping as below under
Mapping
section of your sink.
Here is how my source and final sink data looks like:
Source Data:
Final Sink Data:
Hope this helps.
Please don’t forget to Accept Answer
and Yes
for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.