UTF-8 character not getting converted to unicode while using openrowset in sql server

Question

UTF-8 character not getting converted to unicode while using openrowset in sql server

Vijayabhaskar R 21

Am trying to import data from Json using Openrowset. The file contains non eglish characters and the same are not getting retrieved as it is. instead getting like junk characters which are actually in UTF-8 format and when we convert in online and check the desired out put is coming.

Declare @Feed Varchar(MAX)

Select @Feed=
    BulkColumn
    from OPENROWSET(Bulk N'D:\Json_Samples\Test.json',SINGLE_BLOB,CODEPAGE = '65001') Json

        Select * from 
        OpenJson(@Feed,'$')
        With(
            NativeName Nvarchar(100) '$.NativeBranchName'
        )

When i execute this am getting the text as à¦§à¦¾à¦®à§ à¦°à¦¾ à¦¬à¦¨à§ à¦¦à¦° and if i convert this UTF-8 to unicode am getting ধামুরা বন্দর which is what expected.

How to get this ধামুরা বন্দর.

When i use OPENROWSET(Bulk N'D:\Json_Samples\Test.json',NSINGLE_NCLOB,CODEPAGE = '65001') Json am getting the below error

SINGLE_NCLOB requires a UNICODE (widechar) input file. The file specified is not Unicode.

If i pass the json text directly by prefixing N am getting the expected output. But when pass the whole file am not able to get the desired out put.

Declare @Json nvarchar(Max)
Set @Json=
N'{
   "SchemaVersion":"2",
   "BranchName":"V-#$$&((*&*^&&%$Bhaskar",
   "NativeBranchName":"ধামুরা বন্দর"
}'

Accepted answer

1 additional answer

Your answer

Answer 1

I can get this to work - but only if I am on SQL 2019 and I am in a database with a UTF-8 collation. It seems that when you select any of the SELECT_ xLOB options, CODEPAGE is ignored.

I was able to develop a workaround, though. For this to work, you need this format file:

   9.0  
   1  
   1 SQLCHAR 0 0 "\r\n" 2 json ""

When you save it, make sure that you remove the leading spaces added by the forum software.

Here is a solution for SQL 2017 and up (replace the file names with your paths):

   CREATE TABLE #temp (ident  int IDENTITY,  
                       txt    nvarchar(MAX) NOT NULL)  
   BULK INSERT #temp FROM 'C:\temp\slask.json'  
   WITH (FORMATFILE ='C:\temp\slask.fmt', CODEPAGE=65001)  
   SELECT * FROM #temp  
   DECLARE [@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000) nvarchar(MAX)  
   SELECT [@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000) = string_agg(txt, '') WITHIN GROUP (ORDER BY ident)  
   FROM  #temp  
     
   SELECT * FROM   
   OpenJson([@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000),'$')  
   With (  
      NativeName Nvarchar(100) '$.NativeBranchName'  
   )  
   go  
   DROP TABLE #temp

If you are on SQL 2016 and earlier, replace the SELECT with string_agg, with this SELECT:

   SELECT [@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000) =   
      (SELECT txt AS [text()]  
       FROM   #temp  
       ORDER  BY ident  
       FOR XML PATH(''), TYPE).value('.', 'nvarchar(MAX)')

Vijayabhaskar R 21 Reputation points

2021-09-19T14:17:39.857+00:00

Thank you so much. It worked.

Answer 2

Andy Verezhak 1

VARCHAR(x) is not utf-8. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.

NVARCHAR(x) is not utf-16 (but very close to it, it's ucs-2). This is a 2-byte-encoded string covering almost any known characters (but exceptions exist).

utf-8 will use 1 byte for plain latin characters, but 2 or even more bytes to encoded foreign charsets.

A VARBINARY(x) will hold the utf-8 as a meaningless chain of bytes.

A simple CAST or CONVERT will not work: VARCHAR will take each single byte as a character. For sure this is not the result you would expect. NVARCHAR would take each chunk of 2 bytes as one character. Again not the thing you need.

But using right collation you can use this code:

Declare @Feed nvarchar(MAX)  
  
Select @Feed=  
         convert(nvarchar(max), coalesce(BulkColumn,'' collate Cyrillic_General_100_CI_AS_SC_UTF8 -- any UTF8 collation)) collate Cyrillic_General_CI_AS -- you default database collation  
         from OPENROWSET(Bulk N'D:\Json_Samples\Test.json',SINGLE_BLOB,CODEPAGE = '65001') Json  
              
             Select * from   
             OpenJson(@Feed,'$')  
             With(  
                 NativeName Nvarchar(100) '$.NativeBranchName'  
             )

More detail here https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/

Erland Sommarskog 121.9K Reputation points MVP Volunteer Moderator

2022-09-20T21:03:17.083+00:00

VARCHAR(x) is not utf-8. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.

If the collation has UTF8 at the end of tje name, varchar(n) is indeed IUTF-8. That is, one character can take up more than one byte. And that is nothing that is new with SQL 2019 and the UTF-8 collations. If you have a Japanese collation, you can store Japanese characters in varchar, and they will take up two bytes per character. Same goes for Chinese and a few more.

NVARCHAR(x) is not utf-16 (but very close to it, it's ucs-2).

If the collation has _SC in the name or a version number of 140, nvarchar(n) is indeed true UTF-16.
Andy Verezhak 1 Reputation point

2022-09-21T09:32:54.467+00:00
You are absolutely right!
All my post was about case then default collation is not UTF-8 and _SC. Because database size does matter.
I wanted to show that we can use Data type precedence and Collation precedence to achieve the desired result.
In my code the conversion operation was redundant.
My approach does not require format file and creating temp table. And can be used to convert any column containing UTF-8 binary data.
COALESCE expression returns data type with the highest data type precedence varbinary->nvarchar And then "Collation precedence" works Implicit X + Explicit Y = Explicit Y the same way we can use CONCAT function.
Example:

declare @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A -- contains UTF-8 text: Hello and smile emoji select cast(@utf8Binary as nvarchar(max)), concat(@utf8Binary, N'' ), concat(@utf8Binary, N'' COLLATE Cyrillic_General_100_CI_AS_SC_UTF8), concat(@utf8Binary, '' ), concat(@utf8Binary, '' COLLATE Cyrillic_General_100_CI_AS_SC_UTF8), coalesce(@utf8Binary, '' COLLATE Cyrillic_General_100_CI_AS_SC_UTF8);

Share via

UTF-8 character not getting converted to unicode while using openrowset in sql server

1 additional answer

Your answer