UTF-8 character not getting converted to unicode while using openrowset in sql server

Vijayabhaskar R 21 Reputation points
2021-09-19T09:21:02.457+00:00

Am trying to import data from Json using Openrowset. The file contains non eglish characters and the same are not getting retrieved as it is. instead getting like junk characters which are actually in UTF-8 format and when we convert in online and check the desired out put is coming.

Declare @Feed Varchar(MAX)

Select @Feed=
    BulkColumn
    from OPENROWSET(Bulk N'D:\Json_Samples\Test.json',SINGLE_BLOB,CODEPAGE = '65001') Json

        Select * from 
        OpenJson(@Feed,'$')
        With(
            NativeName Nvarchar(100) '$.NativeBranchName'
        )

When i execute this am getting the text as ধামৠরা বনৠদর and if i convert this UTF-8 to unicode am getting ধামুরা বন্দর which is what expected.

How to get this ধামুরা বন্দর.

When i use OPENROWSET(Bulk N'D:\Json_Samples\Test.json',NSINGLE_NCLOB,CODEPAGE = '65001') Json am getting the below error

SINGLE_NCLOB requires a UNICODE (widechar) input file. The file specified is not Unicode.

If i pass the json text directly by prefixing N am getting the expected output. But when pass the whole file am not able to get the desired out put.

Declare @Json nvarchar(Max)
Set @Json=
N'{
   "SchemaVersion":"2",
   "BranchName":"V-#$$&((*&*^&&%$Bhaskar",
   "NativeBranchName":"ধামুরা বন্দর"
}'
Developer technologies | Transact-SQL
0 comments No comments
{count} votes

Accepted answer
  1. Erland Sommarskog 121.5K Reputation points MVP Volunteer Moderator
    2021-09-19T10:04:48.803+00:00

    I can get this to work - but only if I am on SQL 2019 and I am in a database with a UTF-8 collation. It seems that when you select any of the SELECT_ xLOB options, CODEPAGE is ignored.

    I was able to develop a workaround, though. For this to work, you need this format file:

       9.0  
       1  
       1 SQLCHAR 0 0 "\r\n" 2 json ""  
    

    When you save it, make sure that you remove the leading spaces added by the forum software.

    Here is a solution for SQL 2017 and up (replace the file names with your paths):

       CREATE TABLE #temp (ident  int IDENTITY,  
                           txt    nvarchar(MAX) NOT NULL)  
       BULK INSERT #temp FROM 'C:\temp\slask.json'  
       WITH (FORMATFILE ='C:\temp\slask.fmt', CODEPAGE=65001)  
       SELECT * FROM #temp  
       DECLARE [@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000) nvarchar(MAX)  
       SELECT [@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000) = string_agg(txt, '') WITHIN GROUP (ORDER BY ident)  
       FROM  #temp  
         
       SELECT * FROM   
       OpenJson([@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000),'$')  
       With (  
          NativeName Nvarchar(100) '$.NativeBranchName'  
       )  
       go  
       DROP TABLE #temp  
    

    If you are on SQL 2016 and earlier, replace the SELECT with string_agg, with this SELECT:

       SELECT [@](/users/na/?userId=b1b84c87-4001-0003-0000-000000000000) =   
          (SELECT txt AS [text()]  
           FROM   #temp  
           ORDER  BY ident  
           FOR XML PATH(''), TYPE).value('.', 'nvarchar(MAX)')  
    
    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Andy Verezhak 1 Reputation point
    2022-09-20T07:33:51.497+00:00

    VARCHAR(x) is not utf-8. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.

    NVARCHAR(x) is not utf-16 (but very close to it, it's ucs-2). This is a 2-byte-encoded string covering almost any known characters (but exceptions exist).

    utf-8 will use 1 byte for plain latin characters, but 2 or even more bytes to encoded foreign charsets.

    A VARBINARY(x) will hold the utf-8 as a meaningless chain of bytes.

    A simple CAST or CONVERT will not work: VARCHAR will take each single byte as a character. For sure this is not the result you would expect. NVARCHAR would take each chunk of 2 bytes as one character. Again not the thing you need.

    But using right collation you can use this code:

    Declare @Feed nvarchar(MAX)  
      
    Select @Feed=  
             convert(nvarchar(max), coalesce(BulkColumn,'' collate Cyrillic_General_100_CI_AS_SC_UTF8 -- any UTF8 collation)) collate Cyrillic_General_CI_AS -- you default database collation  
             from OPENROWSET(Bulk N'D:\Json_Samples\Test.json',SINGLE_BLOB,CODEPAGE = '65001') Json  
                  
                 Select * from   
                 OpenJson(@Feed,'$')  
                 With(  
                     NativeName Nvarchar(100) '$.NativeBranchName'  
                 )  
    

    More detail here https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.