Remove duplicate rows from xlsx in Azure Data Factory

Question

Remove duplicate rows from xlsx in Azure Data Factory

Dinesh Prajapati 126

Hi team,

I want to remove duplicate rows from the xlsx. It should be like if the data of all the columns of first row matches with the data of all the columns of second row, then we need to remove anyone column from the xlsx.
I searched and came to know about aggregrate activity, but in that I need to manually map the column name. As my pipeline is generic, there would be 30 xlsx that would be going through the pipeline, so how can we handle this dynamically.

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-03-17T05:49:14.0566667+00:00

@Dinesh Prajapati We still have not heard back from you. Just wanted to check if the below information was helpful? If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thank you

3 answers

Your answer

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-03-17T05:49:14.0566667+00:00

@Dinesh Prajapati We still have not heard back from you. Just wanted to check if the below information was helpful? If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thank you

Answer 1

Hi @Dinesh Prajapati ,

Thanks for using Microsoft Q&A forum and posting your query.

In order to remove duplicate records yes you can utilize Aggregate transformation. But the trick part in this requirement is to create a dynamic/generic flow which can help process any CSV file irrespective of schema changes. Please follow below steps to achieve the same:

1. First use a derived column transformation to create a column that holds the string array of drifted column names (expression: columnNames('source1', true()))
Then use a select transformation to select on the newly created column which holds the array of column names.
Then use sink cache to write data the array columnNames into Spark cache instead of data store so that we can use these values in a different stream where we implement the aggregation to remove duplicates.
Next add a data source in a new stream pointing to your folder where you have the excel files and make sure to check box Allow schema drift as you are having multiple excel files which uses this data flow for transforming the data.
1. Then have a derived column transformation to create a group by column for doing the aggregate in the subsequent transformation. In this column in dynamic expression we will refer to the cached Column names from previous stream using this expression:

   array(byNames(sinkCacheColumnNames#output().src_ColumnNames, 'source2'))

User's image

1. Now have an aggregate transformation where we will remove duplicate columns grouping by AggregateColumn created in previous derived column transformation as shown below, and then In Aggregates settings, add column pattern as below:
Then finally have a sink transformation and point to your desired sink and configure rule based mapping as below under Mapping section of your sink.

Here is how my source and final sink data looks like:

Source Data:

User's image

Final Sink Data:

User's image

Hope this helps.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Answer 2

@Dinesh Prajapati Actually, you can achieve identifying distinct rows just by using Aggregate transformation. Sorry that I have overlooked the requirement initially. Please add below code to your data flow script which will create a aggregate transformation and point to your source which will help suffice your requirement.

aggregate(groupBy(mycols = sha2(256,columns())),
    each(match(true()), $$ = first($$))) ~> DistinctRows

Here is the documentation related to it: Mapping data flow - Distinct row using all columns

User's image

Here is a demo on how to add above script to your dataflow script:

DataFlowDistinctRows

Hope this helps. Let me know if you have any questions.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Answer 3

Dinesh Prajapati 126

hi.. I implemented this logic, but for some files where it has UUID in the excel, I am getting issue. The issue is very weird. Can we something else for this

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-04-11T01:30:13.63+00:00

@Dinesh Prajapati - Thanks for getting back. I tested above scenario and was working for me. But if you think it is specific to your source data, then could you please open a new thread with more specific details about the issue including a sample file attached so that it would help us reproduce the issue from our end.

Thank you

Share via

Remove duplicate rows from xlsx in Azure Data Factory

3 answers

Your answer