BULK insert: UTF8 file and field can be multi-line and having carriage return character

fouad fallouh 21 Reputation points
2021-05-12T13:22:57.487+00:00

Hello

I want to use BULK insert for a text file with UTF8 encoding.
Some of my field value can be : a string multi-line, ie contains carriage return character

this is the content of the file tst_tbl.txt: (in UTF8 and CRLF line end marker)

field_number|field_txt
0| word1, word2
word3, word4
0|line2

This is the test case:

Drop Table if exists tmp_owner.tst_tbl;

CREATE TABLE tmp_owner.tst_tbl(
field_number numeric(38, 8) NULL,
field_txt nvarchar(255) NULL
) ;

truncate table tmp_owner.tst_tbl;

BULK INSERT tmp_owner.tst_tbl FROM 'C:\tst_tbl.txt' WITH (CODEPAGE = 65001, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n', batchsize = 5000000, FIRSTROW = 2)

==> error:
Msg 4864, Level 16, State 1, Line 22
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 3, column 1 (field_number).

it is expected error but how can I resolve it?
I tried to make ROWTERMINATOR composed of many characters like "¤\n" , and I added at the end of each line the character ¤, the bulk insert is executed without error but does not import row.
BULK INSERT dxc_std_dev20400_aia_tmp.tst_tbl FROM 'C:\SQLTEMP\PDM_Import\PDM_Import testcases\bulk_file_folder\pdm_dump_data\one\tst_tbl.txt' WITH (CODEPAGE = 65001, FIELDTERMINATOR = '|', ROWTERMINATOR = '¤\n', batchsize = 5000000, FIRSTROW = 2)

(0 rows affected)

Completion time: 2021-05-12T15:19:09.5000775+02:00

I tried to make FIELDTERMINATOR and ROWTERMINATOR many character : this imply 0 row importing

Please help
Thanks

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,515 questions
{count} votes

Accepted answer
  1. Erland Sommarskog 98,636 Reputation points
    2021-05-17T21:13:30.113+00:00

    So I have done some research on the topic, including reading what I have written myself about it here: https://www.sommarskog.se/bulkload.html#UTF8 :-)

    BULK INSERT is not able to handle terminators outside the ASCII range when you have a UTF8 file. I have filed a bug for this that you can vote for: http://feedback.azure.com/forums/908035-sql-server/suggestions/32899759-bulk-insert-cannot-load-utf-8-files-with-a-non-asc

    If you want to stick with BULK INSERT, you need to use a separator that is in the ASCII range.

    If you want to use the Universal Currency Sign, you can load the file with BCP:

    bcp Somedatabase.tmp_owner.tst_tbl in tst_tbl.txt -c -t"|" -r"¤\n" -S Server -T -F2 -C 65001
    

    Note that the  before the ¤ is not a typo. You need to specify the separator as a byte sequence, so ¤ are not UTF-8 here, but Latin-1. Confusing, eh? (And to muddle the waters even more, \n actually stands for \r\n.)

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. fouad fallouh 21 Reputation points
    2021-05-17T15:24:31.233+00:00

    Hi again
    This is an example of the file:
    97169-tst-tbl.txt

    97139-code.txt

    The example utf8 encoding file is tst_tbl.txt (with multi-line, the code that I run is in code.txt

    The result of execution is:

    (0 rows affected)

    (0 rows affected)

    Completion time: 2021-05-17T17:16:03.7220937+02:00

    Thanks

    0 comments No comments

  2. fouad fallouh 21 Reputation points
    2021-05-18T07:51:14.087+00:00

    Hello
    Thank you very much for your answer it is very helpful

    I hope Microsoft fix this problem as you have already mention it in your bug declaration : BULK INSERT cannot load UTF-8 files with a non-ASCII field terminator

    Regards

    0 comments No comments