Remove duplicate rows from xlsx in Azure Data Factory

Dinesh Prajapati 126 Reputation points
2023-03-10T08:44:55.0566667+00:00

Hi team,

I want to remove duplicate rows from the xlsx. It should be like if the data of all the columns of first row matches with the data of all the columns of second row, then we need to remove anyone column from the xlsx.
I searched and came to know about aggregrate activity, but in that I need to manually map the column name. As my pipeline is generic, there would be 30 xlsx that would be going through the pipeline, so how can we handle this dynamically.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,625 questions
{count} votes

3 answers

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-03-11T01:12:21.9833333+00:00

    Hi @Dinesh Prajapati ,

    Thanks for using Microsoft Q&A forum and posting your query.

    In order to remove duplicate records yes you can utilize Aggregate transformation. But the trick part in this requirement is to create a dynamic/generic flow which can help process any CSV file irrespective of schema changes. Please follow below steps to achieve the same:

      1. First use a derived column transformation to create a column that holds the string array of drifted column names (expression: columnNames('source1', true()))
        User's image
    1. Then use a select transformation to select on the newly created column which holds the array of column names.
      User's image
    2. Then use sink cache to write data the array columnNames into Spark cache instead of data store so that we can use these values in a different stream where we implement the aggregation to remove duplicates.
      User's image
    3. Next add a data source in a new stream pointing to your folder where you have the excel files and make sure to check box Allow schema drift as you are having multiple excel files which uses this data flow for transforming the data.
      User's image
      1. Then have a derived column transformation to create a group by column for doing the aggregate in the subsequent transformation. In this column in dynamic expression we will refer to the cached Column names from previous stream using this expression:
       array(byNames(sinkCacheColumnNames#output().src_ColumnNames, 'source2'))
    
    

    User's image

      1. Now have an aggregate transformation where we will remove duplicate columns grouping by AggregateColumn created in previous derived column transformation as shown below, and then In Aggregates settings, add column pattern as below:
        User's image User's image
    1. Then finally have a sink transformation and point to your desired sink and configure rule based mapping as below under Mapping section of your sink. User's image

    Here is how my source and final sink data looks like:

    Source Data:

    User's image

    Final Sink Data:

    User's image

    Hope this helps.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

  2. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-03-15T23:26:47.1366667+00:00

    @Dinesh Prajapati Actually, you can achieve identifying distinct rows just by using Aggregate transformation. Sorry that I have overlooked the requirement initially. Please add below code to your data flow script which will create a aggregate transformation and point to your source which will help suffice your requirement.

    aggregate(groupBy(mycols = sha2(256,columns())),
        each(match(true()), $$ = first($$))) ~> DistinctRows
    

    Here is the documentation related to it: Mapping data flow - Distinct row using all columns

    User's image

    Here is a demo on how to add above script to your dataflow script:

    DataFlowDistinctRows

    Hope this helps. Let me know if you have any questions.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

  3. Dinesh Prajapati 126 Reputation points
    2023-04-10T08:56:04.0266667+00:00

    hi.. I implemented this logic, but for some files where it has UUID in the excel, I am getting issue. The issue is very weird. Can we something else for this


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.