Optimising BULK INSERT

Question

Optimising BULK INSERT

JCE H 41

Hi all,

I sometimes receive some large text files and loading them can take hours and I'd like to take that load time down. The loads are part of a stored procedure that under the covers uses BULK INSERT. My first thought was that there's various talk about running BULK INSERT in parallel so I wondered what would be the difference between loading a single file or splitting that file in 2 and trying to load the 2 files simultaneously. In order to give myself enough time to catch waits and check the processors I made a dummy table with 3 varchar(255) columns of dummy data and a total number of 20,000,000 rows.

I suspect that is were my knowledge ends because I tried 3 different combinations.

DECLARE @wert smalldatetime
DECLARE @edt smalldatetime

PRINT '2 statements as 1 exec'
BEGIN

SET @wert = getdate()

SET @bcpquery = '
bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn1.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn2.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
'
execute sp_executesql @bcpquery;

SET @edt = getdate()
SELECT '2 statements as 1 exec',DATEDIFF(SECOND,@wert ,@edt)

END

TRUNCATE TABLE ParallelBulkTest

PRINT '2 statements separate'
BEGIN

SET @wert = getdate()

bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn1.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);
bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn2.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

SET @edt = getdate()
SELECT '2 statements separate',DATEDIFF(SECOND,@wert ,@edt)

END

TRUNCATE TABLE ParallelBulkTest

PRINT '1 big file'

SET @wert = getdate()
BEGIN

bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn0.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

SET @edt = getdate()
SELECT '1 big file',DATEDIFF(SECOND,@wert ,@edt)

END

All 3 methods took exactly 60 seconds each. As far as I can see all processors lit up regardless of the method. I only ever saw SOS_SCHEDULER_YIELD waits so I assume no real physical constraints to throughput.

From this I can only conclude that BULK INSERT optimises itself in such a way that you might as well just chuck the biggest file you can at it and let it work it out itself? Or is there more I can try?

Accepted answer

5 additional answers

Your answer

Answer 1

so I wondered what would be the difference between loading a single file or splitting that file in 2 and trying to load the 2 files simultaneously.

The scripts in your question execute the bulk insert statements serially rather than in parallel. Note that T-SQL statements within a batch always execute serially.

To test parallel bulk insert performance, you need 2 different connections:

Session 1:

PRINT 'connection 1'  
  
SET @sdt = getdate()  
  
bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn1.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);  
  
SET @edt = getdate()  
SELECT 'connection 1',DATEDIFF(SECOND,@sdt,@edt)

Session 2:

PRINT 'connection 2'  
  
SET @sdt = getdate()  
  
bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn2.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);  
  
SET @edt = getdate()  
SELECT 'connection 2',DATEDIFF(SECOND,@sdt,@edt)

Consider specifying a smaller batch size (e.g. BATCHSIZE = 50000) with large files as mentioned in the performance considerations section of the BULK INSERT documentation.

Answer 2

Hi Dan,

Thanks for that. Is there an easy way to setup separate connections within a proc?

I might be answering my own question here but on a whim I thought I would try BCP in. Now, I fully expected BCP IN to be slower than BULK INSERT but so far it has matched BULK INSERT on 60 seconds for a single large file or absolutely blitzed in in under 15 seconds. That latter was achieved by creating a batch file and then triggering that with xp_cmdshell. Unfortunately there's a load of stuff going on on the server so it is hard to isolate exactly what might be affecting my longer run times.

RESULTS OF TESTING.

ID Description RUNS--> r1 r2 r3 r4 r5
1 2 BULK INSERT statements as 1 exec 60 60 60 60 60
2 2 BULK INSERT statements separate 120 60 60 60 60
3 BULK INSERT of 1 big file 60 60 120 60 60
4 BCP 1 big file 0 60 0 0 0
5 BCP 2 files simultaneously(?!) 0 0 0 0 60

Total elapsed time
r1?!
r2?!
r3 3m 49s
r4 3m 50s
r5 3m 38s

SCRIPT FOR BCP PARTS.

DECLARE @results TABLE(id tinyint IDENTITY(1,1),d varchar(255),Duration int)

PRINT '1) 2 statements as 1 exec'
BEGIN
SET @wert = getdate()

SET @bcpquery = '
bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn1.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn2.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
'
execute sp_executesql @bcpquery;

SET @edt = getdate()

INSERT INTO @results(d,Duration)
SELECT '2 BULK INSERT statements as 1 exec',DATEDIFF(SECOND,@wert ,@edt)

END

SELECT COUNT(*) as Records FROM ParallelBulkTest

------------------------------------------------------------------

TRUNCATE TABLE ParallelBulkTest

PRINT '2) 2 statements separate'
BEGIN
SET @wert = getdate()

bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn1.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);
bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn2.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

SET @edt = getdate()
INSERT INTO @results(d,Duration)
SELECT '2 BULK INSERT statements separate',DATEDIFF(SECOND,@wert ,@edt)

END

SELECT COUNT(*) as Records FROM ParallelBulkTest

------------------------------------------------------------------

TRUNCATE TABLE ParallelBulkTest

PRINT '3) 1 big file'

BEGIN
SET @wert = getdate()

bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn0.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

SET @edt = getdate()
INSERT INTO @results(d,Duration)
SELECT 'BULK INSERT of 1 big file',DATEDIFF(SECOND,@wert ,@edt)

END

SELECT COUNT(*) as Records FROM ParallelBulkTest

------------------------------------------------------------------

TRUNCATE TABLE ParallelBulkTest

PRINT '4) BCP 1 big file'

BEGIN
SET @wert = getdate()

SET @CMD = 'bcp "tempdb.dbo.ParallelBulkTest" IN "D:\Data\Temp\BulkLoad\DataIn0.txt" -c -t "|" -F 1 -h "TABLOCK" /S "' + @Testta + '" -T'
exec xp_cmdshell @CMD , no_output

SET @edt = getdate()
INSERT INTO @results(d,Duration)
SELECT 'BCP 1 big file',DATEDIFF(SECOND,@wert ,@edt)

END

SELECT COUNT(*) as Records FROM ParallelBulkTest

------------------------------------------------------------------

TRUNCATE TABLE ParallelBulkTest

PRINT '5) BCP 2 files simulataneously'

BEGIN
SET @wert = getdate()

SET @CMD = 'START /b bcp.exe "tempdb.dbo.ParallelBulkTest" IN "D:\Data\Temp\BulkLoad\DataIn1.txt" -c -t "|" -F 1 -h "TABLOCK" /S "' + @Testta + '" -T
START /b bcp.exe "tempdb.dbo.ParallelBulkTest" IN "D:\Data\Temp\BulkLoad\DataIn2.txt" -c -t "|" -F 1 -h "TABLOCK" /S "' + @Testta + '" -T' ---b 10000
--PRINT @CMD

SELECT @CMD as c INTO ##A

SET @CMD = 'bcp "##A" out "' + @filepath + 'Runbat.bat" /c /t "|" /S "' + @Testta + '" -T'
exec xp_cmdshell @CMD , no_output

DROP TABLE ##A

SET @CMD = 'CALL "D:\Data\Temp\BulkLoad\Runbat.bat"'
exec xp_cmdshell @CMD , no_output

SET @edt = getdate()
INSERT INTO @results(d,Duration)
SELECT 'BCP 2 files simulataneously',DATEDIFF(SECOND,@wert ,@edt)

END

SELECT COUNT(*) as Records FROM ParallelBulkTest

SELECT * FROM @results

------------------------------------------------------------------

Erland Sommarskog 121.9K Reputation points MVP Volunteer Moderator

2021-12-09T22:55:28.037+00:00

The above code is also perfectly serial.

Thanks for that. Is there an easy way to setup separate connections within a proc?

Beside the CLR option that Dan mentioned, you could also do this with Service Broker. That is, you would post the BULK INSERT commands on a Service Broker with multiple queue readers.

But that is a the kind of solution you tend to use when all you have is an SQL hammer. A client-side screw driver may be a better tool here. Some people may even argue that this is why they invented SSIS. (Which I have never used myself.)

Answer 3

Is there an easy way to setup separate connections within a proc?

Although one can technically have multiple connections in a SQLCLR proc, it would also need to implement multi-threading to run the bulk inserts in parallel. But don't bother going there since it will require an UNSAFE assembly and, as you've learned, a client application (BCP, SSIS, etc.) provides more options for for scale-out and/or multi-threading.

You could run multiple instances of BCP in parallel within a single batch file using START commands but those will run asynchronously and the batch file would return while they are still running unless you take special measures.

Answer 4

Hi，@JCE H
You can achieve parallel bulk insert with SSIS.
SSIS has two properties: MaxConcurrentExecutables and EngineThreads. By setting these two properties in SSIS package, the tasks in your SSIS package will be executed parallel.
Following are two blogs for your reference:
Parallel Execution of Tasks in SSIS
Parallel execution in SSIS

Best regards,
LiHong

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our [Documentation] to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer 5

Yes, I know about SISS. Don't go there, it is too rigid. I think Azure can handle ever changing columns but I've not had a chance to play with Azure yet.

Moving on, I spotted I had missed the TABLOCK off the bulk insert statements. Adding that in makes it comparable to BCP. I tried changing the batchsize to 50k records but that made no difference.

It looks like BULK INSERT of 1 large file is a) easy, b) efficient. It optimises itself pretty well unless you've got particularly odd data and spreads itself across all processors nicely. Maybe older versions of SQL weren't so good but from SQL2016 onwards it really is pretty clever if you have your hardware setup correctly. Well done to whoever worked on that!

                                                            TABLOCK!    BACTHSIZE 50k

ID Description RUNS--> r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12
1 2 BULK INSERT statements as 1 exec 60 60 60 60 60 60 60 60 0 0 0 0
2 2 BULK INSERT statements separate 120 60 60 60 60 60 0 0 0 0 60 60
3 BULK INSERT of 1 big file 60 60 120 60 60 60 0 0 0 60 0 0
4 BCP 1 big file 0 60 0 0 0 0 0 0 60 0 0 0
5 BCP 2 files simultaneously(?!) 0 0 0 0 60 0 60 60 0 0 60 60

Total elapsed time
r1?!
r2?!
r3 3m 49s
r4 3m 50s
r5 3m 38s
r6 3m 47s
r7 1m 19s -- TABLOCK ADDED TO BULK INSERT STATEMENTS HERE. FORGOT BEFORE!
r8 1m 18s
r9 1m 18s
r10 1m 44s -- BATCHSIZE 50k ADDED
r11 1m 35s
r12 1m 39s

In short, that's 100m rows imported in under 2 minutes and that includes stopping to count 5 lots of 20 million rows after each insert and inserting the results into a table.

Share via

Optimising BULK INSERT

5 additional answers

Your answer