Synapse SQL Database distinct has duplicate results

Question

Synapse SQL Database distinct has duplicate results

Xia 0

I have a table in Azure SQL Database with all the columns in format nvarchar(4000), here's an example of values

column names : col_A, col_B, col_C, col_D

values: a, b, c, d

I see that those two lines are duplicates.

query: select distinct * from table_name returns both two lines

query: select distinct col_A from table_name where col_B=b, col_C=c, col_D=d returns two lines of value a

query: select distinct col_A from table_name where col_B=b, col_C=c returns one line of value a

query: select distinct col_A from table_name where col_B=b, col_D=d returns one line of value a

I found this strange result because when I try to replace COPY ACTIVITY with Dataflow, it insert the duplicates instead of updating based on the primary key, so I run some queries to test.

Before, I read csv files and use copy activity to upsert (base on both 4 columns as composite primary key), there was no duplicates

Now, I replaced it with data flow to handle the same file and I did set all keys on sink, it creates duplicates.

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-09-11T20:13:27.5066667+00:00

@Xia Just checking in to see if the below information was helpful. If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. If you have any further query, do let us know.

Thank you

1 answer

Your answer

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-09-11T20:13:27.5066667+00:00

@Xia Just checking in to see if the below information was helpful. If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. If you have any further query, do let us know.

Thank you

Answer 1

Amira Bedhiafi 33,071 Volunteer Moderator

Sometimes, hidden characters or trailing spaces can make values look identical, but they're actually different. Try trimming the values and see if that makes any difference:


    SELECT DISTINCT LTRIM(RTRIM(col_A)), 

                    LTRIM(RTRIM(col_B)), 

                    LTRIM(RTRIM(col_C)), 

                    LTRIM(RTRIM(col_D))

    FROM table_name

SQL Server (and by extension Azure SQL Database) can be case sensitive or insensitive based on the collation setting. If the collation is case-sensitive, then 'A' and 'a' would be considered different values. To check this, you can convert all characters to lowercase (or uppercase) and then do the distinct:


SELECT DISTINCT LOWER(col_A), LOWER(col_B), LOWER(col_C), LOWER(col_D)

FROM table_name

It's possible that the way Dataflow reads or writes data might be slightly different than the COPY ACTIVITY method. For example, if the data format has changed or if there are issues with data types, conversions, or the way it handles null values.

When using Dataflow, it's crucial to ensure that the upsert operation is correctly configured based on the primary key. It might be helpful to double-check the sink's configuration and the mapping of columns. You might want to inspect the output of your Dataflow transformations before they get to the sink. By isolating the transformation, you can figure out at which step the duplicates are introduced.

Make sure there aren't any concurrent operations running on the same dataset. For instance, if another operation inserts data at the same time as your Dataflow, you might see inconsistencies.

If you're performing the operation in a transaction, be aware of the transaction's isolation level. Some isolation levels allow for 'dirty reads' which can result in reading uncommitted data, leading to perceived duplicates.

Xia 0 Reputation points

2023-09-13T09:57:54.9133333+00:00

Hello @Amira Bedhiafi

Thanks for your answer.

I tested DISTINCT with TRIM() or LTRIM(RTRIM()), It return correctly only one line.

When I use LEN(), it returns two lines with same values, that means all the data has same length. I think TRIM function change the type or something.

Don't you think the difference between the number of results of my second and third query in question looks strange ?

In those two queries, number of lines returned by less constraints < number of lines with more constraints

It shouldn't be the opposite ?
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-09-13T10:56:05.6666667+00:00

Can you share what you have tried ? with the results
Xia 0 Reputation points

2023-09-13T13:54:46.17+00:00

The first query returns 2 same result. If I comment line 49, it returns 1 result, when I remove all the criteria from line 47, it returns 1 result. Isn't it strange ?
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-09-13T14:48:34.5633333+00:00

Can you try a SELECT * ? and compare the results?
Xia 0 Reputation points

2023-09-13T14:58:42.64+00:00

Yes, select * returns two lines with same values.

If I select distinct all columns, the result depends on the criteria just as the images I sent in my previous comment.
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-09-13T15:24:54.8266667+00:00

Can you provide your data in https://dbfiddle.uk/ER_9PaV-

the table structure and the data so I can write the queries from my side?

Share via

Synapse SQL Database distinct has duplicate results

1 answer

Your answer