Share via


Two Fast Ways to Bulk Insert Client Generated Data to SQL Database

 

Bulk loading client side data into a SQL database is a common task in software development. Sometimes the data is client generated and cached. When the data accumulates to a specific size or over a specific period time, it's then bulk loaded into the database, for example, web server logs or system events.

 

On Microsoft platform, several approaches can be used to achieve this goal very effectively. Here are two approaches, using ADO.Net DataTableReader class and using ADO.Net to construct a Dynamic TSQL. Both approaches use a new SQL 2008 feature, table valued parameter (TVP) as the vehicle to transmit data to SQL server. They are all fairly easy to implement and have good performance results which will be detailed later.

 

First is the setup at the SQL side that we will be utilized by the client ADO.Net code. We will create a database, a table, a TVP type and a procedure. The procedure has a TVP type parameter passed in from the client and populates the table. Here's the script:

 

CREATE database DemoDB;

GO

CREATE table dbo.MyTvpTargetTable

(

   c1 int

  ,c2 int

)

GO

CREATE TYPE dbo.MyTVP AS TABLE

(

   c1 int

  ,c2 int

)

GO

Create proc MyProc (@tvp dbo.MyTVP readonly)

as

begin

   insert into MyTvpTargetTable (c1,c2)

   select c1, c2 from @tvp

end

GO

 

 

Now the first approach, using ADO.Net DataTableReader with TVP

 

In the nutshell, a DataTable is used to cache our accumulated data. Once it reaches a size, we use a DataTableReader to read it. We also create a SqlCommand that is of StoredProcedure CommandType. The SqlCommand has a SqlParameter which has the value of the DataTableReader. The code goes like this:

 

    static void InsertWithDataTableReader(int rowCount)

    {

      // used to time how fast this function runs later

      DateTime dt = DateTime.Now;

     

      // Create a new DataTable

      DataTable table = new DataTable("tvp");

      // Declare variables for DataColumn and DataRow objects.

      DataColumn column;

      DataRow row;

      // Create new DataColumn, set DataType,

      // ColumnName and add to DataTable.

      column = new DataColumn();

      column.DataType = System.Type.GetType("System.Int32");

      column.ColumnName = "c1";

      column.ReadOnly = true;

      // Add the Column to the DataColumnCollection.

      table.Columns.Add(column);

      // Create second column.

      column = new DataColumn();

      column.DataType = System.Type.GetType("System.Int32");

     column.ColumnName = "c2";

      column.ReadOnly = true;

      // Add the Column to the DataColumnCollection.

      table.Columns.Add(column);

      // Imagine how you can accumulate your data in this DataTable.

      // When it reaches a size, we will bulk load to SQL

      for (int i = 1; i <= rowCount; i++)

      {

        // Use a DataRow object and add

        // it to the DataTable.

        row = table.NewRow();

        row["c1"] = i;

        row["c2"] = -i;

        table.Rows.Add(row);

      }

      // To read the DataTable and to be passed in as value of the SqlParameter

      DataTableReader dtReader = new DataTableReader(table);

      SqlConnection sqlconnection = new SqlConnection("Persist Security Info=False;Integrated Security=true;Initial Catalog=DemoDB;Server=(local)");

      sqlconnection.Open();

      // Configure the SqlCommand and table-valued parameter.

      SqlCommand insertCommand = new SqlCommand("MyProc", sqlconnection);

      insertCommand.CommandType = CommandType.StoredProcedure;

      SqlParameter tvpParam =

         insertCommand.Parameters.AddWithValue(

         "@tvp", dtReader);

      tvpParam.SqlDbType = SqlDbType.Structured;

      // Execute the command.

      insertCommand.ExecuteNonQuery();

      //Time how fast we run this function

      Console.WriteLine(DateTime.Now - dt);

    }

This function can be used in a console app. You will also need to reference these assemblies:

 

using System.Data.SqlClient;

using System.Data.Sql;

using System.Data.Common;

using System.Data;

My test method is to call this method ten times and each time it will "accumulate" 1,000 rows. So we have a test load of 10,000 rows:

 

    static void Main(string[] args)

    {

      DateTime dt = DateTime.Now;

      for (int i = 0; i < 10; i++)

          InsertWithDataTableReader(1000);

           // InsertWithDynamicSQL(1000);

      Console.Write("total run time: ");

      Console.WriteLine(DateTime.Now - dt);

      Console.Read();

    }

 

It took 236 ms on my box to insert 10,000 rows (10 batches x 1000 rows). It's 42373 rows per second. Pretty fast. This approach scales very good too. I tried a few different batch sizes for a few rounds and the performance is consistent and it has a better throughput when the batch sizes increase. Since the streaming of data is robust, as the data volume increases, the overhead of various client side .Net classes is also diluted. The throughput ramps up after reaching 500,000 rows and stay at above 110,000 rows per seconds. See below for comparison.

 

 

Rows

Seconds

Rows / S

10000

0.236

42372.88

50000

0.578

86505.19

100000

0.972

102880.66

500000

4.4

113636.36

1000000

8.7

114942.53

5000000

44.5

112359.55

 

 

On the SQL side, it's pretty calm. There's no CPU time used to parse and compile. This is a proc call after all. Most CPU time is used to insert data. One million row insertion takes 4,413 ms CPU time. Not a whole lot of ticks and seems very reasonable to me.

 

One thing worth mentioning in the profiler trace is the TextData of the RPC:Completed event of the proc call. It looks like this and is misleading:

 

declare @p1 dbo.myTVP

insert into @p1 values(1,-1)

insert into @p1 values(2,-2)

insert into @p1 values(3,-3)

insert into @p1 values(4,-4)

insert into @p1 values(5,-5)

......

insert into @p1 values(1000,-1000)

exec MyProc @tvp=@p1

 

This seems like a bunch of singleton calls but it's not. As my colleage Keith Elmore at Microsoft CSS pointed out, that is just the convention used to allow the data to be displayed or copied/pasted into a query window and run. The TVP isn’t actually implemented in that way—the rows are streamed across just like a bulk insert—but it may have led to this impression that it does simple inserts.

 

Then there's second approach using ADO.Net to construct a SQL Dynamic TSQL.

 

This approach is often frowned upon for the SQL injection security concern. It's got okay performance, compared to the first approach. It put weights on SQL server and does not scale well. I listed this approach as a side by side comparison with the first approach.

 

I am not going to elaborate much. Basically, in ADO.Net we construct a dynamic SQL to be executed at the SQL server side. There's not much at the client except the string concatenation. So what kind of string we should build? Obviously, you don't want to build a bunch of single inserts like we just saw above in the first approach. It's running very slow.

 

So how about a single VALUES set with all the rows like this?

 

declare @p1 dbo.myTVP;

insert into @p1

select c1, c2 from (values(1,-1),(2,-2),(3,-3),......,(3000,-3000))

as alias (c1,c2)

;exec myproc @tvp=@p1

 

It works but can't scale. When I try a batch size of 10,000 rows, my console app times out. So a compromise is to build batches of 1000 rows and UNION ALL together:

 

declare @p1 dbo.myTVP;

insert into @p1

select c1, c2 from (values(1,-1),(2,-2),(3,-3),......,(1000,-1000))

as alias (c1,c2)

union all

select c1, c2 from (values(1001,-1001),(1002,-1002),(1003,-1003),......,(2000,-2000))

as alias (c1,c2)

union all

select c1, c2 from (values(2001,-2001),(2002,-2002),(2003,-2003),......,(3000,-3000))

as alias (c1,c2)

;exec myproc @tvp=@p1

 

The C# function to build such a string to be passed to SQL can be seen here:

 

    static void InsertWithDynamicSQL(int rowCount)

    {

      DateTime dt = DateTime.Now;

      StringBuilder sql = new StringBuilder("declare @p1 dbo.myTVP; insert into @p1 select c1, c2 from (values");

      for (int i = 1; i <= rowCount; i++)

      {

        sql.Append("(");

        sql.Append(i.ToString() + "," + (-i).ToString());

        sql.Append(")");

        if (i == rowCount)

        {

          sql.Append(") as alias (c1,c2)");

          break;

        }

        if (i % 1000 == 0)

          sql.Append(") as a1 (c1,c2) union all select c1, c2 from (values");

        else

          sql.Append(",");

      }

      sql.Append(";exec myproc @tvp=@p1");

      SqlConnection sqlconnection = new SqlConnection("Persist Security Info=False;Integrated Security=true;Initial Catalog=DemoDB;Server=(local)");

      sqlconnection.Open();

      // Configure the SqlCommand and table-valued parameter.

      SqlCommand insertCommand = new SqlCommand(sql.ToString(), sqlconnection);

      insertCommand.CommandType = CommandType.Text;

     

      // Execute the command.

      insertCommand.ExecuteNonQuery();

      Console.WriteLine(DateTime.Now - dt);

    }

The approach puts a lot of pressures on the SQL side in the form of high CPU usage. As we remember in the first approach, the CPU time used to insert one million rows into SQL is mere 4.4 seconds. In this dynamic SQL approach, just to parse and compile a 10,000 row batch takes 4.4 seconds. We are not even executing the batch yet!

 

Here's the parsing and compilation time (no execution time) for different rows. I can't get any higher batch row number as it times out. You get the idea of how this doesn't scale well.

               

1000 rows

317ms

10000 rows

4400ms

30000 rows

13860ms

 

While compilation is a resource (CPU) intensive operation, all the compiled execution plans are mostly used just once. Two batches must match character by character so that the execution plan can be reused. Since our real world data is constantly changing, it's virtually impossible to have two same batches.

 

Summary

 

I think we have a clear winner here from all perspectives.

 

The first approach with DataTable, DataTableReader and TVP is robust, fast and scale well. It uses much less resource and put less strains on SQL server. This is important for scale out. SQL can have multiple clients all streaming data in without bringing down the database.

 

If you have any better solution, show it off! Otherwise, leave a comment and let me know if this post is helpful.

 

Happy coding!

 

Mike Chang (Chien-Shen)

SQL SDE, MSDN