If you have SQL2019, you may try the following:
select substring (N'Hello C̨̆, how are you' COLLATE Latin1_General_100_CI_AI_SC,1,7)
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Hello
we have to implement a subset of utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306). If I had a string like
'Hello C̨̆, how are you'
and I make a substring (N'Hello C̨̆, how are you',1,7) I get
Hello C
with substring (N'Hello C̨̆, how are you',1,255) I get
Hello C̨̆, how are you
How could I avoid this problem
Kind regards
Andreas
Hello
sorry, I didn't tell it, we have SQL 2017
If it is difficult to solve the problem in SQL, I think that it is possible to write an external function in C#. There are corresponding classes and functions.
utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306)
SQL Server 2017 don't support UTF-8 encoding, only UniCode, that UTF-8 character gets change to two Unicode characters.
No, that is not UTF-8. You have UTF-16. The above is not a valid UTF-8 sequence. And as you can see below, I can convert the string to binary, and the binary to the string to get that character. (Keep in mind that in the binary value we need to swap bytes, because Wintel is a little-endian architecture.
DECLARE @str nvarchar(10) = N'C̨̆'
DECLARE @str2 nvarchar(10) = convert(nvarchar(10), 0x430028030603)
SELECT @str, @str2, len(@str), len(@str2)
And indeed, there are three characters in the string, as you will see if you paste the above into SSMS.
Here in my web browser, there is only one character it seems, but it is a capital C followed by two combining accents. You could of course argue that the combing accents count as part of the length, but SQL Server does not seem to be that smart.