Share via

Database and Table Column level collation mismatch and explicit collation call in code

Parameswaran Serussery Narayanan 155 Reputation points
2025-07-16T22:03:20.3666667+00:00

 Hello,

I have the following ETL scenario and looking for the optimal solution.

Source: SAP ERP -> SAP SLT (Change data capture)

Destination: Azure Synapse dedicated SQL Pool

Target Database Collation: SQL_Latin1_General_CP1_CI_AS

We have 200+ tables in SAP ERP system for which regular data feed happens to Synapse SQL Pool via SAP SLT process. Out of the 200+ tables, there are about 35+ tables that have case sensitive data in few of the columns. We also have the synapse database collation defaulted to SQL_Latin1_General_CP1_CI_AS. Understanding here is that if a table is created, this default collation is propagated to each of the table column until and unless they are explicitly defined for a different collation as part of the table creation process.

Let's consider the below scenario now where each table has got millions of rows and more being added and updated on a daily basis and the tables have been defined with Round Robin distribution.

Example:

Table A from SAP has Column 1 (Primary join for Table B), Column 2 and Column 3.

Out of these 3 columns in Table A, Column 1 has been defined with case sensitive collation SQL_Latin1_General_CP1_CS_AS because the data is case sensitive at source.

Sample for Table A,

User's image

Table B from SAP has Column 1, Column 2 and Column 3 (Foreign key for Column 1 from Table A) . All the 3 columns in this table have inherited the default database collation of SQL_Latin1_General_CP1_CI_AS because the table was defined without any explicit collation for the Column 3.

 Sample for Table B,

 User's image

Table C from SAP has Column 1, Column 2 and Column 3 (Foreign key for Column 3 from Table B and Column1 from Table A) . All the 3 columns in this table have inherited the default database collation of SQL_Latin1_General_CP1_CI_AS because the table was defined without any explicit collation for the Column 3.

Sample for Table C,

User's image

Join Condition 1:

Simple join between Table A and Table B using Column1 and Column3 fails with collation error mismatch

Temporary workaround: Explicitly use COLLATE SQL_Latin1_General_CP1_CS_AS in the predicate

Join Condition 2: Simple join between Table B and Table C on Column 3 of each table with a filter condition of 'ABC' on Column 3 of Table B will bring both the records from table C because the column has been defined with SQL_Latin1_General_CP1_CI_AS collation in Table B and Table C.

Issue: There are business needs and processing needs that require only Table B and Table C to be linked together for various reporting purpose as shown in Join Condition 2. The result set will be duplicate rows as both 'ABC' and 'Abc' will be treated as same string due case insensitivity definition for Table B and Table C. In this case, the error will not be thrown and it will hard for business to identify these duplicates until and unless it is closely tested.

The solution that is being recommended by the development team is to use case sensitive COLLATION explicitly (workaround) across all the joins (in this case Table B and Table C) and in any code that is written within stored objects. The issue is also pointed towards the wrong collation setup at the database level (SQL_Latin1_General_CP1_CI_AS). This is something that business users might not be aware of until they end up reporting wrong records as part of the report.

My understanding is that, there are various performance bottleneck associated with COLLATE being explicitly called in join conditions if there are indexes created on the tables. In our scenario, all the tables have clustered columstore index by default and there are no explicit index like unique index on these columns as well. If COLLATE is explicitly called in the predicate, the INDEX Seek will be skipped and a INDEX Scan will be applied on the table there by hampering the performance of the queries and with tables been defined with round robin distribution, it will only add more shufflemove in addition to the entire table being scanned.

I also understand that if the database collation itself is changed to case sensitive (SQL_Latin1_General_CP1_CS_AS), there may be issue with data loads. Example if the tables and corresponding table columns coming from source are all in a specific case (be it lower or upper or mixed case) and if the user defines the same table in a different case that doesn't match the source, with the sql pool being case sensitive, the loads will fail stating either table/object not found or schema mismatch error. This is because the target synapse sql pool database treats Table 'A' and Table 'a' as two different objects and synapse sql pool would want these to match exactly as in source to load the data.

Solution Recommended: The apt solution being proposed by the data architect team is to define Column 3 in Table B and Column 3 in Table C with the exact collation definition as in Column 1 of Table A, which in this case is SQL_Latin1_General_CP1_CS_AS. This can be achieved by either recreating the table and reloading the data or by altering the table by keep it empty and dumping the data back to the same table. This will ensure all the columns that have the same data align and no explicit COLLATION needs to be defined while writing codes in SQL pool. The business users need not worry about pulling duplicated records in the report. This will also ensure that any INDEX created isn't skipped by the data base while generating the EXPLAIN PLAN for each query. The basic understanding is that the characteristic feature of the data in Column 1 of Table A should be the same in Column3 of Table B and Table C and it should be defined as consistent across all the tables where the same data is being loaded.

Any advice on the above solution is highly recommended as this is a very critical piece before the database is deeply used by the business.

Regards,

Param

 

Azure Synapse Analytics
Azure Synapse Analytics

An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.


Answer accepted by question author

  1. Venkat Reddy Navari 5,840 Reputation points Microsoft External Staff Moderator
    2025-07-17T03:39:01.0966667+00:00

    Hi Parameswaran Serussery Narayanan The issue you're facing with collation mismatches it's a pretty common challenge, especially when dealing with large datasets like the one you're working with. From what you've described, it seems like the solution your data architect team proposed makes a lot of sense, but I wanted to share a few additional thoughts to help make sure you cover all your bases.

    Aligning Collation at the Column Level:

    You're right to focus on aligning the collation at the column level, especially for columns involved in joins like Column 1 in Table A and Column 3 in Tables B and C. That should definitely help with the collation mismatch and avoid the errors you're seeing when performing joins. While changing collations at the column level might seem like a bit of a hassle, it will ensure everything matches up properly and eliminates the need for explicit COLLATE calls in your queries something that could really hurt performance if done repeatedly in joins.

    Recommendation:

    Just be sure to test this thoroughly in a non-production environment to ensure it doesn’t break any existing ETL or reporting processes. You don't want any surprises, especially if you have downstream dependencies that might rely on case-insensitive behavior.

    Performance Concerns with COLLATE:

    I totally get your concern about performance when using COLLATE in your queries. If the COLLATE clause is added to join conditions, it's going to bypass any index seeks and cause full table scans, which can be brutal when you’re working with large tables in Synapse, especially with round-robin distribution in place.

    Recommendation:

    • If possible, avoid using COLLATE in every query. If it's absolutely necessary, try to limit its use to a few columns and maybe experiment with columnstore index optimizations to mitigate some of the performance hits.
    • Also, you could create computed columns with the desired collation and then index those columns. This way, you won’t have to deal with the performance overhead in every query that joins these tables.

    Avoiding a Global Database Collation Change:

    Changing the collation of the entire database to be case-sensitive might seem like an easy fix, but you’re right to be cautious about this. Changing database-level collation can break things in a big way, especially with how case-sensitive table and column names are treated. This could lead to issues with data loads where case mismatches happen, causing errors that could be hard to debug.

    Recommendation:

    I’d recommend sticking with the column-level collation adjustments. It’s a much more flexible solution that won't cause global issues in your database and will allow you to handle specific cases where case sensitivity matters.

    Conclusion:

    Overall, I think the approach you're taking aligning the column collations is the way to go. It ensures that you won't run into collation errors or duplicate data in reports, and you’ll avoid performance issues caused by excessive use of COLLATE in your queries. It’s a more sustainable solution in the long run.

    • Be sure to test this solution with realistic data in a staging environment before applying it to production.
    • Also, consider using stored procedures or views to manage the collation logic in one place, which can help keep your reporting processes clean and consistent.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.