nvarchar and substring

Question

nvarchar and substring

Andreas Liedhegener 1

Hello

we have to implement a subset of utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306). If I had a string like

'Hello C̨̆, how are you'

and I make a substring (N'Hello C̨̆, how are you',1,7) I get

Hello C

with substring (N'Hello C̨̆, how are you',1,255) I get

Hello C̨̆, how are you

How could I avoid this problem

Kind regards
Andreas

EchoLiu-MSFT 14,621 Reputation points

2021-04-16T06:27:51.713+00:00

Do you have anyupdate?
Please also remember to accept the answers if they helped. Your action would be helpful to other users who encounter the same issue and read this thread.

5 answers

Your answer

EchoLiu-MSFT 14,621 Reputation points

2021-04-16T06:27:51.713+00:00

Do you have anyupdate?
Please also remember to accept the answers if they helped. Your action would be helpful to other users who encounter the same issue and read this thread.

Answer 1

cheong00 3,486 Volunteer Moderator

If you have SQL2019, you may try the following:

select substring (N'Hello C̨̆, how are you' COLLATE Latin1_General_100_CI_AI_SC,1,7)

Answer 2

Andreas Liedhegener 1

Hello

sorry, I didn't tell it, we have SQL 2017

Answer 3

Viorel 122.5K

If it is difficult to solve the problem in SQL, I think that it is possible to write an external function in C#. There are corresponding classes and functions.

Answer 4

Olaf Helper 47,436

utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306)

SQL Server 2017 don't support UTF-8 encoding, only UniCode, that UTF-8 character gets change to two Unicode characters.

Answer 5

Erland Sommarskog 121.4K MVP Volunteer Moderator

No, that is not UTF-8. You have UTF-16. The above is not a valid UTF-8 sequence. And as you can see below, I can convert the string to binary, and the binary to the string to get that character. (Keep in mind that in the binary value we need to swap bytes, because Wintel is a little-endian architecture.

DECLARE @str nvarchar(10) = N'C̨̆'
DECLARE @str2 nvarchar(10) = convert(nvarchar(10), 0x430028030603)
SELECT @str, @str2, len(@str), len(@str2)

And indeed, there are three characters in the string, as you will see if you paste the above into SSMS.

Here in my web browser, there is only one character it seems, but it is a capital C followed by two combining accents. You could of course argue that the combing accents count as part of the length, but SQL Server does not seem to be that smart.

cheong00 3,486 Reputation points Volunteer Moderator

2021-04-14T01:46:51.597+00:00
This documentation suggest that on SQL2016 or later, if you use any collations that end with "_SC", it will cater for surrogate characters in UTF-16.

You may check the collation support by the following SQL statement:

select * from sys.fn_helpcollations() where name like '%_SC'
Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-04-14T06:38:37.187+00:00

This documentation suggest that on SQL2016 or later, if you use any collations that end with "_SC", it will cater for surrogate characters in UTF-16.

There are no surrogates here. There are three characters, and they are all in the Unicode base plane. Surrogates are in the range D800-DFFFF.
cheong00 3,486 Reputation points Volunteer Moderator

2021-04-14T06:54:51.717+00:00

Oops, you're right. The latter 2 code points are "diacritical marks".
Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-04-14T19:13:18.953+00:00
And indeed, there are three characters in the string, as you will see if you paste the above into SSMS.

I realised later that this depends on the font.. In the SSMS I tried, I had Courier, and with this font the character calls apart. It works better with Lucida Console.

I made some tests on Postgres, and it appears like SQL Server. That is finds three characters for the C + the two combining accents, and substring chops the character in the middle of the sequence.

There was an N missing in my example above, which I have corrected.

Finally, here is an illustration that SQL Server does understand the combining accent. @a and @b are considered equal, even if the collation is accent-sensitive - although they have different lengths.

DECLARE @a nvarchar(10) = convert(nvarchar(10), 0x41000003), @b nvarchar(10) = 'À' SELECT @a, @b, IIF(@a = @b, 1, 0), IIF(len(@a) = len(@b), 1, 0)

Share via

nvarchar and substring

5 answers

Your answer