Load 1TB in less than 1 hour

OVERVIEW

 

This project was done using SQL Server 2005 Enterprise Edition (with SP1 beta) using the BULK INSERT command with 60 parallel input files, each of approximately 16.67 GB.  The best time I got was 50 minutes.  Hewlett Packard was kind enough to loan me some time one night on the machine they use to do the TPC-C tests.  It is an Itanium2 Superdome with 64 1.6GHZ 9M  (Madison) CPUs.  For the insert tests the system was equipped with 256GB of RAM. The HP SAN configuration is rated at 14GB per second throughput.  You can find the specific configurations at https://tpc.org/tpcc/results/tpcc_result_detail.asp?id=103082701.  (Note:  The configurations at this site may be updated periodically).

 

 

BEST PRACTICES and LESSIONS LEARNED

Here is a list of things we learned during this project that should be helpful in your projects.

  1. Run as many load processes as you have available CPUs.  If you have 32 CPUs, run 32 parallel loads.  If you have 8 CPUs, run 8 parallel loads. 
  2. If you have control over the creation of your input files, make them of a size that is evenly divisible by the number of load threads you want to run in parallel.  Also make sure all records belong to one partition if you want to use the switch partition strategy.
  3. Use BULK insert instead of BCP if you are running the process on the SQL Server machine. 
  4. Use table partitioning to gain another 8-10%, but only if your input files are GUARANTEED to match your partitioning function, meaning that all records in one file must be in the same partition.
  5. Use TABLOCK to avoid row at a time locking.
  6. Use ROWS PER BATCH = 2500, or something near this if you are importing multiple streams into one table.

 

 

TEST SETUP

The files were on the same SAN as the database, but on different drive arrays.  There were 12 files on a single drive array and it took 5 separate disk arrays to handle all 60 input files.  The database has two filegroups and 96 files. The table used for inserts was LINEITEM which resided on its own filegroup and is placed on the outer tracks of the disk.

 

The database was in BULK LOGGED mode, so there was very little transaction log activity.  The extent allocations and related metadata are logged, but not the individual page changes and not the individual record inserts.

 

The input files were character mode with the pipe | character as column delimiters.  These flatfiles were generated using DBGen, a standard tool supplied by the TPC Council. The flatfiles can be considered clean, i.e. no scrubbing or any other operation is needed. The row size was a variable length, approximately 130 bytes on average.

 

The final setup included creating 60 jobs each containing a single BULK INSERT command.  This was done so that the number of running commands could be easily controlled without having a lot of query windows opened.  The sp_startjob command was used to start each individual job.

 

The first test was done into a single table with ROWS PER BATCH = 2500.  Very little blocking was observed until 40 threads were started.  Since batch size was 2,500, the locks were fairly short and were short lived.  Running one of the new Data Management Views (DMV) told me that the blocks were coming from waits on extent allocations on the single table.  Variations of these tests were very consistently completed in 65 minutes.  In an attempt to avoid the extent allocation waits, the next tests were designed to use multiple tables.

 

The next round of tests were done with separate tables for each input.  Each table had one partition and at the end, the partitions were switched in to a single large table that contained 60 partitions.  (Switching is a new term used in SQL 2005 table partitioning).  The switching is very fast because it is just a meta-data switch, and our tests averaged 22 – 35 milliseconds per switch.  The difficulty here was to make sure that each record in each input file was really part of that partition.  The input files could only contain records that matched that partition range and did not contain records from other ranges.  With this test, it made very little difference what the batch size was since there was always one BULK INSERT thread per table.  TABLOCK was an important to avoid any lock management.

 

In the end, this gave us about 8-10% improvement in overall throughput by going to separate tables. This was the load scenario that took 50 minutes to just to load the data into the table.  There was not time to do a full test of the index build or the constraint creation that validates the partition.  These steps are saved for a later test.

 

 

CPU BOTTLENECK

The limiting factor was CPU, not disk IO.  The SAN has a throughput rating of 14GB per second.  SQLIO was the tool used to measure the maximum throughput of the drives.  With all 60 threads running BULK INSERT in our tests, the maximum write activity was 333MB/sec.  This gave a maximum of slightly over 2.1 million BULK COPY rows per second.  PERFMON was used to watch these counters.

 

One BULK INSERT thread would take over an idle CPU and consume 99% to 100%.   Subsequent jobs were very good at choosing another CPU and did not seem to schedule themselves on the same as a busy CPU. 

 

TEST PROCESS

The first tests were run by starting one job at a time and letting it run to a steady state.  All measurements were then recorded and another job was started. The increase in each of the counters being watched was very predictable, but not quite linear meaning that the second job was running at a rate just less than the first.  In fact, it turns out that the curve is logarithmic and starts to become very flat after 60 concurrent jobs.

 

The first tests were also using BCP until I quickly noticed that each BCP process was taking up a few percent of a CPU.  Then by 55 jobs, the CPU total was at 100%, so I switched to BULK INSERT so that the import threads ran under the SQL Server process. After switching to the BULK INSERT command, the throughput increased for each job up to 60 concurrent jobs.  At this point, the total CPU was consistently over 95%.  At 61 jobs to 64 jobs, the throughput was roughly the same as at 60. 

 

During the first tests, there were 96 files of approximately 8GB to 9GB in size.  I could easily import the first 48 files in a very consistent 30 minutes and the other 48 files in 30 minutes as long as I did not run over 60 jobs at a time.  Loading 1TB in 60 minutes is a pretty good time. 

 

Once I noticed the CPU pattern being the limiting factor, I created 60 files each approximately 16.67GB in size.  This gave the best run time of 50 minutes but was somewhat of a manufactured test because not every customer will have this fine grained control over their input files.  As mentioned earlier, these input files exactly matched the final partitioning scheme that we designed for the large table.

 

The final test was to start all 60 jobs at once to make sure the CPU balancing was at a desired level.  It was fine, meaning that there were never any two BULK INSERT threads running on the same CPU.  The developers on the SQL product team tell me that they specifically check the UMS scheduler to attempt to have only one bulk command per scheduler.  Of course, if you run more commands than you have CPUs it will schedule multiple bulk commands per CPU.

 

NOT FINISHED YET

At this point we had all the data in 60 separate tables.  This is a good scenario for anyone wanting to use local partitioned views, which is defined by creating a single view that does a UNION ALL over the tables.  In this case, we would be done with the load. 

 

Customers that would like to use table partitioning have to take several more steps to get all these partitions into one table.  The steps are as follows:

  1. Create a File Scheme
  2. Create a Partition Function
  3. Create a clustered index on each file
  4. Create the check constraint that will be used for partitioning.
  5. Switch the partitions into the final large table.

There is a reason to create the index before the check constraint.  If we did it the other way around a table scan would be done to validate each row against the constraint.  By creating the clustered index first, the constraint can check the first and last values to ensure that all rows have passed the constraint.

 

NEXT TESTS

There are plenty of other ideas of what to try next, but since I was just borrowing time at night between the TPC-H runs to do this, I will have to wait until the machines are available again. However, given system availability Hewlett Packard has offered to continue with this line of tests.  Actually, the patterns are so very predictable that the tests could be run on machines with less CPUs to see the patterns and timings.  Of course, you won’t be able to load 1TB in less than an hour unless you have a lot of CPUs. 

 

  • Run the BCP or BULK insert on other client machines to remove the need to SQL Agent and SQL Server to handle the direct reading of the files.
  • Move the input files to another computer to see what the load pattern looks like with the network involved.
  • Run using the SSIS bulk insert task.  Run this from other machines to lower the CPU burden on the SQL Server machine.  If we run this on the SQL Server machine, it will most likely be the same effect as using BCP. 
  • Create indexes on the file.
  • Run tests on smaller machines since we don’t expect all customers to have a 64 CPU machine.
  • Presort the input files and create the clustered index first before loading the records. If they are presorted it should allow us to shorten the create index step.
  • See if NUMA settings have any impact.  It may seem that this may not help if we have a CPU bottleneck, but if there is any foreign memory assigned to the nodes it will be slower.  There is a way to avoid getting foreign memory – see https://blogs.msdn.com/slavao

Credits:   Thanks to Sunil Agarwal and his team from the Microsoft SQL Server product group and Mike Fitzner from Hewlett Packard for their assistance.

 

enjoy -- Kevin