Optimising BULK INSERT

JCE H 41 Reputation points
2021-12-09T10:41:37.27+00:00

Hi all,

I sometimes receive some large text files and loading them can take hours and I'd like to take that load time down. The loads are part of a stored procedure that under the covers uses BULK INSERT. My first thought was that there's various talk about running BULK INSERT in parallel so I wondered what would be the difference between loading a single file or splitting that file in 2 and trying to load the 2 files simultaneously. In order to give myself enough time to catch waits and check the processors I made a dummy table with 3 varchar(255) columns of dummy data and a total number of 20,000,000 rows.

I suspect that is were my knowledge ends because I tried 3 different combinations.

DECLARE @wert smalldatetime
DECLARE @edt smalldatetime

PRINT '2 statements as 1 exec'
BEGIN

SET @wert = getdate()

SET @bcpquery = '
bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn1.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn2.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
'
execute sp_executesql @bcpquery;

SET @edt = getdate()
SELECT '2 statements as 1 exec',DATEDIFF(SECOND,@wert ,@edt)

END

TRUNCATE TABLE ParallelBulkTest

PRINT '2 statements separate'
BEGIN

SET @wert = getdate()

bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn1.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);
bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn2.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

SET @edt = getdate()
SELECT '2 statements separate',DATEDIFF(SECOND,@wert ,@edt)

END

TRUNCATE TABLE ParallelBulkTest

PRINT '1 big file'

SET @wert = getdate()
BEGIN

bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn0.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

SET @edt = getdate()
SELECT '1 big file',DATEDIFF(SECOND,@wert ,@edt)

END

All 3 methods took exactly 60 seconds each. As far as I can see all processors lit up regardless of the method. I only ever saw SOS_SCHEDULER_YIELD waits so I assume no real physical constraints to throughput.

From this I can only conclude that BULK INSERT optimises itself in such a way that you might as well just chuck the biggest file you can at it and let it work it out itself? Or is there more I can try?

Developer technologies Transact-SQL
0 comments No comments
{count} votes

Accepted answer
  1. Dan Guzman 9,401 Reputation points
    2021-12-09T11:28:16.387+00:00

    so I wondered what would be the difference between loading a single file or splitting that file in 2 and trying to load the 2 files simultaneously.

    The scripts in your question execute the bulk insert statements serially rather than in parallel. Note that T-SQL statements within a batch always execute serially.

    To test parallel bulk insert performance, you need 2 different connections:

    Session 1:

    PRINT 'connection 1'  
      
    SET @sdt = getdate()  
      
    bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn1.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);  
      
    SET @edt = getdate()  
    SELECT 'connection 1',DATEDIFF(SECOND,@sdt,@edt)  
    

    Session 2:

    PRINT 'connection 2'  
      
    SET @sdt = getdate()  
      
    bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn2.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);  
      
    SET @edt = getdate()  
    SELECT 'connection 2',DATEDIFF(SECOND,@sdt,@edt)  
    

    Consider specifying a smaller batch size (e.g. BATCHSIZE = 50000) with large files as mentioned in the performance considerations section of the BULK INSERT documentation.

    0 comments No comments

5 additional answers

Sort by: Most helpful
  1. JCE H 41 Reputation points
    2021-12-09T20:04:05.767+00:00

    Hi Dan,

    Thanks for that. Is there an easy way to setup separate connections within a proc?

    I might be answering my own question here but on a whim I thought I would try BCP in. Now, I fully expected BCP IN to be slower than BULK INSERT but so far it has matched BULK INSERT on 60 seconds for a single large file or absolutely blitzed in in under 15 seconds. That latter was achieved by creating a batch file and then triggering that with xp_cmdshell. Unfortunately there's a load of stuff going on on the server so it is hard to isolate exactly what might be affecting my longer run times.

    RESULTS OF TESTING.

    ID Description RUNS--> r1 r2 r3 r4 r5
    1 2 BULK INSERT statements as 1 exec 60 60 60 60 60
    2 2 BULK INSERT statements separate 120 60 60 60 60
    3 BULK INSERT of 1 big file 60 60 120 60 60
    4 BCP 1 big file 0 60 0 0 0
    5 BCP 2 files simultaneously(?!) 0 0 0 0 60

    Total elapsed time
    r1?!
    r2?!
    r3 3m 49s
    r4 3m 50s
    r5 3m 38s

    SCRIPT FOR BCP PARTS.

    DECLARE @results TABLE(id tinyint IDENTITY(1,1),d varchar(255),Duration int)

    PRINT '1) 2 statements as 1 exec'
    BEGIN
    SET @wert = getdate()

    SET @bcpquery = '
    bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn1.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
    bulk insert tempdb.dbo.ParallelBulkTest from ''D:\Data\Temp\BulkLoad\DataIn2.txt'' with (fieldterminator = ''|'', rowterminator = ''\n'',FIRSTROW=1);
    '
    execute sp_executesql @bcpquery;

    SET @edt = getdate()

    INSERT INTO @results(d,Duration)
    SELECT '2 BULK INSERT statements as 1 exec',DATEDIFF(SECOND,@wert ,@edt)

    END

    SELECT COUNT(*) as Records FROM ParallelBulkTest

    ------------------------------------------------------------------

    TRUNCATE TABLE ParallelBulkTest

    PRINT '2) 2 statements separate'
    BEGIN
    SET @wert = getdate()

    bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn1.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);
    bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn2.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

    SET @edt = getdate()
    INSERT INTO @results(d,Duration)
    SELECT '2 BULK INSERT statements separate',DATEDIFF(SECOND,@wert ,@edt)

    END

    SELECT COUNT(*) as Records FROM ParallelBulkTest

    ------------------------------------------------------------------

    TRUNCATE TABLE ParallelBulkTest

    PRINT '3) 1 big file'

    BEGIN
    SET @wert = getdate()

    bulk insert tempdb.dbo.ParallelBulkTest from 'D:\Data\Temp\BulkLoad\DataIn0.txt' with (fieldterminator = '|', rowterminator = '\n',FIRSTROW=1);

    SET @edt = getdate()
    INSERT INTO @results(d,Duration)
    SELECT 'BULK INSERT of 1 big file',DATEDIFF(SECOND,@wert ,@edt)

    END

    SELECT COUNT(*) as Records FROM ParallelBulkTest

    ------------------------------------------------------------------

    TRUNCATE TABLE ParallelBulkTest

    PRINT '4) BCP 1 big file'

    BEGIN
    SET @wert = getdate()

    SET @CMD = 'bcp "tempdb.dbo.ParallelBulkTest" IN "D:\Data\Temp\BulkLoad\DataIn0.txt" -c -t "|" -F 1 -h "TABLOCK" /S "' + @Testta + '" -T'
    exec xp_cmdshell @CMD , no_output

    SET @edt = getdate()
    INSERT INTO @results(d,Duration)
    SELECT 'BCP 1 big file',DATEDIFF(SECOND,@wert ,@edt)

    END

    SELECT COUNT(*) as Records FROM ParallelBulkTest

    ------------------------------------------------------------------

    TRUNCATE TABLE ParallelBulkTest

    PRINT '5) BCP 2 files simulataneously'

    BEGIN
    SET @wert = getdate()

    SET @CMD = 'START /b bcp.exe "tempdb.dbo.ParallelBulkTest" IN "D:\Data\Temp\BulkLoad\DataIn1.txt" -c -t "|" -F 1 -h "TABLOCK" /S "' + @Testta + '" -T
    START /b bcp.exe "tempdb.dbo.ParallelBulkTest" IN "D:\Data\Temp\BulkLoad\DataIn2.txt" -c -t "|" -F 1 -h "TABLOCK" /S "' + @Testta + '" -T' ---b 10000
    --PRINT @CMD

    SELECT @CMD as c INTO ##A

    SET @CMD = 'bcp "##A" out "' + @filepath + 'Runbat.bat" /c /t "|" /S "' + @Testta + '" -T'
    exec xp_cmdshell @CMD , no_output

    DROP TABLE ##A

    SET @CMD = 'CALL "D:\Data\Temp\BulkLoad\Runbat.bat"'
    exec xp_cmdshell @CMD , no_output

    SET @edt = getdate()
    INSERT INTO @results(d,Duration)
    SELECT 'BCP 2 files simulataneously',DATEDIFF(SECOND,@wert ,@edt)

    END

    SELECT COUNT(*) as Records FROM ParallelBulkTest

    SELECT * FROM @results

    ------------------------------------------------------------------


  2. Dan Guzman 9,401 Reputation points
    2021-12-09T21:38:08.153+00:00

    Is there an easy way to setup separate connections within a proc?

    Although one can technically have multiple connections in a SQLCLR proc, it would also need to implement multi-threading to run the bulk inserts in parallel. But don't bother going there since it will require an UNSAFE assembly and, as you've learned, a client application (BCP, SSIS, etc.) provides more options for for scale-out and/or multi-threading.

    You could run multiple instances of BCP in parallel within a single batch file using START commands but those will run asynchronously and the batch file would return while they are still running unless you take special measures.

    0 comments No comments

  3. LiHong-MSFT 10,056 Reputation points
    2021-12-10T05:42:12.963+00:00

    Hi,@JCE H
    You can achieve parallel bulk insert with SSIS.
    SSIS has two properties: MaxConcurrentExecutables and EngineThreads. By setting these two properties in SSIS package, the tasks in your SSIS package will be executed parallel.
    Following are two blogs for your reference:
    Parallel Execution of Tasks in SSIS
    Parallel execution in SSIS

    Best regards,
    LiHong


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our [Documentation] to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  4. JCE H 41 Reputation points
    2021-12-10T10:31:20.687+00:00

    Yes, I know about SISS. Don't go there, it is too rigid. I think Azure can handle ever changing columns but I've not had a chance to play with Azure yet.

    Moving on, I spotted I had missed the TABLOCK off the bulk insert statements. Adding that in makes it comparable to BCP. I tried changing the batchsize to 50k records but that made no difference.

    It looks like BULK INSERT of 1 large file is a) easy, b) efficient. It optimises itself pretty well unless you've got particularly odd data and spreads itself across all processors nicely. Maybe older versions of SQL weren't so good but from SQL2016 onwards it really is pretty clever if you have your hardware setup correctly. Well done to whoever worked on that!

                                                                TABLOCK!    BACTHSIZE 50k
    

    ID Description RUNS--> r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12
    1 2 BULK INSERT statements as 1 exec 60 60 60 60 60 60 60 60 0 0 0 0
    2 2 BULK INSERT statements separate 120 60 60 60 60 60 0 0 0 0 60 60
    3 BULK INSERT of 1 big file 60 60 120 60 60 60 0 0 0 60 0 0
    4 BCP 1 big file 0 60 0 0 0 0 0 0 60 0 0 0
    5 BCP 2 files simultaneously(?!) 0 0 0 0 60 0 60 60 0 0 60 60

    Total elapsed time
    r1?!
    r2?!
    r3 3m 49s
    r4 3m 50s
    r5 3m 38s
    r6 3m 47s
    r7 1m 19s -- TABLOCK ADDED TO BULK INSERT STATEMENTS HERE. FORGOT BEFORE!
    r8 1m 18s
    r9 1m 18s
    r10 1m 44s -- BATCHSIZE 50k ADDED
    r11 1m 35s
    r12 1m 39s

    In short, that's 100m rows imported in under 2 minutes and that includes stopping to count 5 lots of 20 million rows after each insert and inserting the results into a table.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.