Synapse SQL Database distinct has duplicate results

Xia 0 Reputation points
2023-08-30T18:50:07.0966667+00:00

I have a table in Azure SQL Database with all the columns in format nvarchar(4000), here's an example of values

column names : col_A, col_B, col_C, col_D

values: a, b, c, d

values: a, b, c, d

I see that those two lines are duplicates.

query: select distinct * from table_name returns both two lines

query: select distinct col_A from table_name where col_B=b, col_C=c, col_D=d returns two lines of value a

query: select distinct col_A from table_name where col_B=b, col_C=c returns one line of value a

query: select distinct col_A from table_name where col_B=b, col_D=d returns one line of value a

I found this strange result because when I try to replace COPY ACTIVITY with Dataflow, it insert the duplicates instead of updating based on the primary key, so I run some queries to test.

Before, I read csv files and use copy activity to upsert (base on both 4 columns as composite primary key), there was no duplicates

Now, I replaced it with data flow to handle the same file and I did set all keys on sink, it creates duplicates.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-09-01T11:58:12.8333333+00:00

    Sometimes, hidden characters or trailing spaces can make values look identical, but they're actually different. Try trimming the values and see if that makes any difference:

    
        SELECT DISTINCT LTRIM(RTRIM(col_A)), 
    
                        LTRIM(RTRIM(col_B)), 
    
                        LTRIM(RTRIM(col_C)), 
    
                        LTRIM(RTRIM(col_D))
    
        FROM table_name
    
    

    SQL Server (and by extension Azure SQL Database) can be case sensitive or insensitive based on the collation setting. If the collation is case-sensitive, then 'A' and 'a' would be considered different values. To check this, you can convert all characters to lowercase (or uppercase) and then do the distinct:

    
    SELECT DISTINCT LOWER(col_A), LOWER(col_B), LOWER(col_C), LOWER(col_D)
    
    FROM table_name
    
    
    

    It's possible that the way Dataflow reads or writes data might be slightly different than the COPY ACTIVITY method. For example, if the data format has changed or if there are issues with data types, conversions, or the way it handles null values.

    When using Dataflow, it's crucial to ensure that the upsert operation is correctly configured based on the primary key. It might be helpful to double-check the sink's configuration and the mapping of columns. You might want to inspect the output of your Dataflow transformations before they get to the sink. By isolating the transformation, you can figure out at which step the duplicates are introduced.

    Make sure there aren't any concurrent operations running on the same dataset. For instance, if another operation inserts data at the same time as your Dataflow, you might see inconsistencies.

    If you're performing the operation in a transaction, be aware of the transaction's isolation level. Some isolation levels allow for 'dirty reads' which can result in reading uncommitted data, leading to perceived duplicates.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.