nvarchar and substring

Andreas Liedhegener 1 Reputation point
2021-04-13T09:06:53.147+00:00

Hello

we have to implement a subset of utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306). If I had a string like

'Hello C̨̆, how are you'

and I make a substring (N'Hello C̨̆, how are you',1,7) I get

Hello C

with substring (N'Hello C̨̆, how are you',1,255) I get

Hello C̨̆, how are you

How could I avoid this problem

Kind regards
Andreas

Developer technologies Transact-SQL
{count} votes

5 answers

Sort by: Most helpful
  1. cheong00 3,486 Reputation points Volunteer Moderator
    2021-04-13T09:16:27.183+00:00

    If you have SQL2019, you may try the following:

    select substring (N'Hello C̨̆, how are you' COLLATE Latin1_General_100_CI_AI_SC,1,7) 
    
    0 comments No comments

  2. Andreas Liedhegener 1 Reputation point
    2021-04-13T09:45:57.89+00:00

    Hello

    sorry, I didn't tell it, we have SQL 2017

    0 comments No comments

  3. Viorel 122.5K Reputation points
    2021-04-13T12:10:56.793+00:00

    If it is difficult to solve the problem in SQL, I think that it is possible to write an external function in C#. There are corresponding classes and functions.

    0 comments No comments

  4. Olaf Helper 47,436 Reputation points
    2021-04-13T13:16:50.647+00:00

    utf-8 (DIN SPEC 91379). Problem is, some 'character' like C̨̆ is a combination of 3 codepoints (0043 0328 0306)

    SQL Server 2017 don't support UTF-8 encoding, only UniCode, that UTF-8 character gets change to two Unicode characters.

    0 comments No comments

  5. Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator
    2021-04-13T22:11:59.573+00:00

    No, that is not UTF-8. You have UTF-16. The above is not a valid UTF-8 sequence. And as you can see below, I can convert the string to binary, and the binary to the string to get that character. (Keep in mind that in the binary value we need to swap bytes, because Wintel is a little-endian architecture.

    DECLARE @str nvarchar(10) = N'C̨̆'
    DECLARE @str2 nvarchar(10) = convert(nvarchar(10), 0x430028030603)
    SELECT @str, @str2, len(@str), len(@str2)
    

    And indeed, there are three characters in the string, as you will see if you paste the above into SSMS.

    Here in my web browser, there is only one character it seems, but it is a capital C followed by two combining accents. You could of course argue that the combing accents count as part of the length, but SQL Server does not seem to be that smart.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.