ODBC Driver 17 for SQL Server on Linux. Charset conversion problem with OTRS (bug?)

Question

ODBC Driver 17 for SQL Server on Linux. Charset conversion problem with OTRS (bug?)

Marc Fauser 1

I updated from ODBC Driver 13 for SQL Server to version 17.
With version 13, everything was working fine, with version 17, I cannot find what's wrong.

We have OTRS running which connects to a SQL Server database to retrieve data for customers.
E.g. we transfer if the customer has a maintenance contract or not.
Yes = thumb up emoji which I cannot post in this forum ( 4 byte character )
No = thumb down emoji which I cannot post in this forum ( 4 byte character )
This was working fine until version 17.
The select (simplified) was
SELECT N'<thumb up emoji>' as maintenance
I changed the select to
SELECT cast( N'<thumb up emoji>' as nvarchar(10) ) as maintenance
and it works again.
In my odbc.ini, I have no charset defined. In OTRS I have defined UTF-8 as a source and destination charset.
Any hints on what could be wrong or is it a bug in the driver as it was working before?

Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-09-21T21:46:49.717+00:00

So what happened when things went wrong?

3 answers

Your answer

Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-09-21T21:46:49.717+00:00

So what happened when things went wrong?

Answer 1

Olaf Helper 47,436

Yes = thumb up emoji which I cannot post in this forum ( 4 byte character )

4 byte char's are UTF-8, while MS SQL Server mainly supports 2 byte's char's = Unicode.
Which data type do you use to store the text and which SQL Server version are you using?

cast( N'<thumb up emoji>' as nvarchar(10) )

And the type nvarchar here is Unicode, not UTF-8.

Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-09-22T21:40:13.13+00:00

4 byte char's are UTF-8, while MS SQL Server mainly supports 2 byte's char's = Unicode.

No that is not correct. Emojis are in one of the supplementary planes, that is the code point is beyond 65536. These characters are encoded with four bytes both in UTF-8 and UTF-16. In UTF-16, the are encoded in so-called surrogate pairs, where all bytes are in the range D800-DFFF.

SQL Server uses UTF_16 for the nvarchar data type, but it only handles surrogate pairs correctly with a collation with _SC in the name (or version number 140). That does not mean that you cannot store emojis with other collations, but with these collation, SQL Server thinks the four bytes in the surrogate pair are two characters.

Answer 2

Hi @Marc Fauser ,

Upon connection, the driver detects the current local of the process it is loaded in, if it used one of the support encoding, the driver uses that encoding for SQLCHAR data, otherwise, it defaults to UTF-8.
Since all process start in the ‘C’locale by default (and cause the driver to default to UTF-8), if an application needs to use one of the encodings, it should use the setlocale function to set the locale appropriately before connecting

In a typical Linux environment where the encoding is UTF-8, users of ODBC Driver 17 upgrading form 13 or 13.1 won’t observe any differences, however, applications that use a non-UTF-8 encoding need to use that encoding for data to/from the driver instead of UTF-8.
https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/programming-guidelines?view=sql-server-ver15

So you should check the character sets/encodings of ODBC 17.

and you should konw, UTF-8 is supported is SQL Server Version 2019, but not previous Versions, they support only ASCII and UniCode

Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-09-22T21:41:45.67+00:00

and you should konw, UTF-8 is supported is SQL Server Version 2019, but not previous Versions, they support only ASCII and UniCode

Well, UTF-8 is Unicode, just a different encoding than UTF-16 that SQL Server uses for the nvarchar data type.

SQL Server also supports many more character sets than just plain ASCII, for instance CP-932 for Japanese. (If you have a Japanese collation.)

Answer 3

So to potentially save some random Googler a lot of time and research...

The Microsoft ODBC driver uses glibc gconv to convert character sets. However there is no dependency on the package and the driver simply fails with the error "[Microsoft][ODBC Driver 17 for SQL Server]Unicode conversion failed", if the necessary gconv modules are not installed!

On some operating systems, the glibc package already contains all the necessary gconv modules to convert from the typical character sets, however on other operating systems only ANSI C UTF-8 support is built-in and the other gconv modules are provided by the glibc-gconv-extra package.

TL;DR the fix is install glibc-gconv-extra .

This could be considered a bug in the Microsoft ODBC driver because of the missing dependency, however it's true that change in the packages is relatively recent.

Share via

ODBC Driver 17 for SQL Server on Linux. Charset conversion problem with OTRS (bug?)

3 answers

Your answer