question

yurylafontaine-8042 avatar image
0 Votes"
yurylafontaine-8042 asked ErlandSommarskog commented

Bulk insert and custom delimiters - (quote Separator - þ ) - (Column - ¶ )

I have search around but have not found anything that has worked in bulk inserting records into a SQL table. I have tried different variation using characters, ASCII and HEX values without success. Each time generating an error. I usually perform alteration in Excel (no SQL experience_), but this file has over 5M records. This has to be possible, does anyone have a working solution or provide additional guidance ? Thank you in advance.

ERROR:

Msg 4866, Level 16, State 1, Line 110 The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly. Msg 7399, Level 16, State 1, Line 110 The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error. Msg 7330, Level 16, State 2, Line 110 Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

Sample Data

84325-test.txt
84269-1.png


84332-2.png


 -- Linefeed - 0x0A
 -- Quote  -  0xFE
 -- Column - 0X14
    
    
    
 BULK INSERT AllTags
 FROM 'C:\Data\Swap Drive\Elegance Brands\01. Document Export\REL000001-REL296747\VOL0002.DAT'
 WITH (
     FIRSTROW = 2 ,
     FIELDTERMINATOR = '0x14',
     ROWTERMINATOR='\n' ,
     BATCHSIZE=250000 ,
     MAXERRORS= 2,
     KEEPNULLS
     );    
 GO




Appreciated any assistance, thank you in advance.

sql-server-generalsql-server-transact-sql
1.png (6.7 KiB)
2.png (8.7 KiB)
test.txt (12.6 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

MelissaMa-msft avatar image
0 Votes"
MelissaMa-msft answered ErlandSommarskog commented

Hi @yurylafontaine-8042,

Welcome to Microsoft Q&A!

After checking, your DDL of AllTags provided is not working with your text provided.

It could be better to provide the complete DDL as below as a good starting:

 drop table if exists AllTags
 CREATE TABLE dbo.AllTags (
 REFERENCEID   varchar (100) ,
 [BEGBATES]    varchar (100) ,
 [ENDBATES]    varchar (100) ,
 [BEGATTACH]   varchar (100) ,
 [ENDATTACH]   varchar (100) ,
 [PARENTBATES] varchar (100) ,
 [ATTACHMENT]  varchar (1000),
 CUSTODIAN varchar(100), 
 DUPCUSTODIAN varchar(100), 
 [FROM] varchar(100), 
 [TO] varchar(100), 
 CC varchar(100), 
 BCC varchar(100), 
 SUBJECT varchar(100), 
 DATESENT varchar(100), 
 TIMESENT varchar(100), 
 DATERCVD varchar(100), 
 TIMERCVD varchar(100), 
 FILEEXT varchar(100), 
 AUTHOR varchar(100), 
 CREATEDATE varchar(100), 
 CREATETIME varchar(100), 
 DATELASTMOD varchar(100), 
 TIMELASTMOD varchar(100), 
 FILENAME varchar(100), 
 DUPFILENAME varchar(100), 
 FILELENGTH varchar(100), 
 PGCOUNT varchar(100), 
 DOCTYPE varchar(100), 
 FAMDATE varchar(100), 
 FAMTIME varchar(100), 
 TIMEZONE varchar(100), 
 PATH varchar(max), 
 DUPPATH varchar(max), 
 DEDUPHASH varchar(100), 
 NATIVEPATH varchar(100), 
 OCRPATH varchar(100), 
 TITLE varchar(100), 
 COMPANY varchar(100), 
 DATEACCESSED varchar(100), 
 TIMEACCESSED varchar(100), 
 DATEPRINTED varchar(100), 
 TIMEPRINTED varchar(100), 
 CONVDATE varchar(100), 
 CONVTIME varchar(100), 
 ATTACHLIST varchar(100), 
 FAMILYRANGE varchar(100), 
 ALLCUSTODIANS varchar(100), 
 ALLFILENAMES varchar(100), 
 ALLFILEPATHS varchar(max), 
 HASHMD5 varchar(100), 
 HASHSHA varchar(100), 
 TAGS varchar(100), 
 DOCNOTE varchar(100), 
 PRIVNOTE varchar(100), 
 REDACTRSNS varchar(100), 
 DISCOID varchar(100), 
 MESSAGEID varchar(100), 
 THREADID varchar(100), 
 ATTACHCOUNT varchar(100), 
 HIDDENTYPE varchar(100), 
 METAREDACTED varchar(100), 
 INREPLYTOID varchar(100), 
 OBJECTHASH varchar(100), 
 REVISION varchar(100), 
 HEADER varchar(100), 
 IMPORTANCE varchar(100), 
 DELIVERYRECEIPT varchar(100), 
 READRECEIPT varchar(100), 
 SENSITIVITY varchar(100), 
 LASTAUTHOR varchar(100), 
 ESUBJECT varchar(100), 
 DATEAPPTSTART varchar(100), 
 DATEAPPTEND varchar(100), 
 CALBEGDATE varchar(100), 
 CALENDDATE varchar(100), 
 CALBEGTIME varchar(100), 
 CALENDTIME varchar(100), 
 CALENDUR varchar(100), 
 RECORDTYPE varchar(100), 
 REVISIONNUMBER varchar(100), 
 Exception varchar(100), 
 ExceptionDetails varchar(100), 
 TextLimitExceeded varchar(100)
 ) 

If your version is SQL Server 2017 (14.x) CTP 1.1 and later, you could have a try with below:

 BULK INSERT AllTags
 FROM 'C:\Data\Swap Drive\REL000001-REL296747\VOL0002.DAT'
  WITH (
      FIRSTROW = 2 ,
      DATAFILETYPE ='char',
      FIELDTERMINATOR = '0x14',
      ROWTERMINATOR = '\n',
      BATCHSIZE=250000 ,
      CODEPAGE=65001,
      MAXERRORS= 2,
      FIELDQUOTE=N'þ' ,  ---adding this row
      KEEPNULLS
      );    
    
 select * from AllTags

Below is part of the output but is a good starting. Then you could try to deal with the data in TSQL side, like using REPLACE function.
84428-tag.png

Best regards
Melissa


If the answer is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


tag.png (16.8 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Egads! You wrangled out all the eighty column names from that file. That's ambitious!

Alas, the attempt with FIELDQUOTE did not work out. As a start, I think you also need FORMAT='CSV'. But that does not really matter here. Because the file is a UTF-8 file, BULK INSERT does not find the terminator, because it is looks for the byte 0xFE, which does not appear in the file. So this is a kind of hopeless in that regard, but the þ has to be dealt with replace (or a find-replace of the file before loading it into SQL Server).

As an extra spice I found that when I ran you BULK INSERT statement,
on SQL 2019, I got an assertion error, so that looks like an SQL Server bug. (I should add that I have UTF8 collation.)

0 Votes 0 ·
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered yurylafontaine-8042 commented

The file you attached is encoded as UTF-8, so you would need to specify CODEPAGE=65001.

But that is not the sole solution. The file format is kind of difficult to read. Do you have a specification of the fields? And what is the definition of the table AllTags?

I'm sure that I understand the delimiter table. It seems that 0x14 is indeed the field terminator, but the glyph shown is the pilcrow which has a different character code. (0x14 is an invisible control character). The Reg-trade mark sign is said the be the line terminator, but the file has CR-LF like a regular Windows file, and you have specified \n as the field terminator.

As for the quote character, we may be table to deal with that with BCP. For the other two - no.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Here is what did. I took the header row and created a TempDatabase with every column.

Usually these files are UTF-8, I think this use \n for the end of line., but was trying each choice to see if i can narrow it down. Notepad ++ displays Cltr-linefeed.

The quote delimiter is "þ" which char(254) - or 0xFE in hex
the column delimiter is DC4 and displays as - char(020) or 0x14 in hex
end of line should be \n but could be CR LF



84353-haderrow.png
USE MASTER
Go
DROP DATABASE IF EXISTS TempDatabase
Create DATABASE [TempDatabase]
USE TempDatabase
GO
CREATE TABLE dbo.AllTags (
[BEGBATES] char (10) ,
[ENDBATES] char (10) ,
[BEGATTACH] char (10) ,
[ENDATTACH] char (10) ,
[PARENTBATES] char (10) ,
[ATTACHMENT] char (100)
) GO

Thank you for assistance! much appreciated.


0 Votes 0 ·
haderrow.png (5.7 KiB)
yurylafontaine-8042 avatar image
0 Votes"
yurylafontaine-8042 answered

Getting closer,

 BULK INSERT AllTags
 FROM 'C:\Data\Swap Drive\REL000001-REL296747\VOL0002.DAT'
 WITH (
     FIRSTROW = 2 ,
     DATAFILETYPE ='char',
     FIELDTERMINATOR = '0x14',
     ROWTERMINATOR = '\n',
     BATCHSIZE=250000 ,
     CODEPAGE=65001,
     MAXERRORS= 2,
     KEEPNULLS
     );    
 GO


Resulted in

Msg 4863, Level 16, State 1, Line 102
Bulk load data conversion error (truncation) for row 2, column 2 (ENDBATES).
Msg 4863, Level 16, State 1, Line 102
Bulk load data conversion error (truncation) for row 3, column 2 (ENDBATES).
Msg 4863, Level 16, State 1, Line 102
Bulk load data conversion error (truncation) for row 4, column 2 (ENDBATES).
Msg 4865, Level 16, State 1, Line 102
Cannot bulk load because the maximum number of errors (2) was exceeded.
Msg 7399, Level 16, State 1, Line 102
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 102
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered

I examined your file, this way. I first created a format file:

14.0
1
1 SQLCHAR 0 0 "\r\n" 2 StyleCode Finnish_Swedish_100_CS_AS_SC_UTF8

Yes, this format file says that it is all one line.

Then I ran this:

drop table if exists AllTags
go
CREATE TABLE dbo.AllTags (
   rowno int IDENTITY(1,1) NOT NULL PRIMARY KEY,
   longline nvarchar(MAX) NOT NULL
)
go
     BULK INSERT AllTags
     FROM 'C:\temp\slask.bcp'
     WITH (
         FORMATFILE = 'C:\temp\slask.fmt',
         CODEPAGE=65001
         );    
go
SELECT rowno, COUNT(*)
FROM   AllTags a
CROSS  APPLY string_split(a.longline, nchar(0x14))
GROUP BY rowno

The output said that were 84 fields on all rows with 0x14 as the delimiter. That is good news so far that this means that we may be able to process this file with BULK INSERT.

(The IDENTITY column is there to allow the grouping, and this IDENTITY column is the reason I have the format file.)

I don't know exactly what your ambition is, but your table has far too columns to accommodate all fields. Then again, you call the last column ATTACHMENT, so maybe you only care about the first columns and don't care about the rest.

In any case, the reason you get the truncation messages is simply that your column widths are too narrow. The longest line in your sample file is 1354 characters. I was able to load the file without errors with your BULK INSERT statement, when I set the table this way:

CREATE TABLE dbo.AllTags (
[BEGBATES]    varchar (20) ,
[ENDBATES]    varchar (20) ,
[BEGATTACH]   varchar (20) ,
[ENDATTACH]   varchar (20) ,
[PARENTBATES] varchar (20) ,
[ATTACHMENT]  varchar (2000)
)

When it comes to the thorns, I suspect that we are out of luck. I have found that BULK INSERT is not able to deal with non-ASCII delimiters in UTF-8 files. But you can strip them in a separate step.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

yurylafontaine-8042 avatar image
0 Votes"
yurylafontaine-8042 answered

These files are an export of an application with an SQL backend with various database. Each database will have any number of fields (which is why use the excel) to quickly create the table in SQL. I have to create various extractions to be able to import this into another database and i can only export this data from the front end. This one has almost 100 columns, others have more.

I usually do this in excel,

  • Replace 0xFE-0x14- 0xFE to /t (tab)

  • remove excessive 0xFE (beginning and end of line characters)

  • load to excel

*sometime depending on the data, columns get scattered, so it doesn't always work.

However, excel starts to choke after a million records or so. this one has 5M, I thought the next logical step here was to start seeing how i can do this in SQL. into the deep end of the pool without a life jacket and not knowing how to swim.

I will modify the able now to see what I get.

Much appreciated the assistance, than you!

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

yurylafontaine-8042 avatar image
0 Votes"
yurylafontaine-8042 answered ErlandSommarskog commented

UREKA!

First - That you both for wisdom and tutelage. I would been pulling my hair out for a few more days.

I have been reviewing the documentation on BCP and how to created the format file. since I am new to this stuff I have to figure out every piece to deal each database.

I kept getting truncation error because of fields lengths. So I change all fields to (MAX) after I noticed @MelissaMan used one field type. I made the mistake of trying to keep field types as they appeared in the data and this makes it much easier. Since this is just a staging area for more field manipulation and the transposing of data, I don't think it affects anything.


@ErlandSommarskog - I also took your last suggestion and striped out the "þ" character before importing the data. I just used the sed command in ubuntu shell (Command below), one needs to be careful of the file encoding as that changes the syntax of the command. since this is a UTF-8 file the "þ" character is represented by xC3\xBE. I will need to do this in SQL.

 sed -i '/s\xC3\xBE//g' VOL001.dat
    
 https://unicode-table.com/en/00FE/

84505-done.png




Again, thank you, this was EXCELLENT!


done.png (21.3 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Great to hear that you got things working!

In case you need to fight BCP or BULK INSERT more in the future, I have an article on my web site that you may find useful: http://www.sommarskog.se/bulkload.html.

0 Votes 0 ·