Does ADO.NET update batching really do something?

I’ve gotten this same question several times in the last few weeks, so I thought it would be a good idea to put together some details on the topic. Update batching is a new ADO.NET 2.0 feature that, when enabled, introduces an optimization to the process of executing a DataAdapter.Update call (which basically takes the changes from a DataTable and pushes them into a database using the INSERT/UPDATE/DELETE commands).

Without batching, Update walks through each change in the DataTable and for each one of them it executes the corresponding command (obtained through the InsertCommand, UpdateCommand and DeleteCommand properties of the adapter). Each execution is separated from the next one, so your application keeps going back and forth over the network between the client and the server.

With batching, we can take a bunch of these operations and send them in a single round trip. This optimizes the operation in a number of ways, the most notorious one being having less network round-trips, which saves quite a bit of overhead in most (although not all) cases.

Batching implementation strategy #1
One way of implementing batching is to concatenate a bunch of SQL statements together. For example, if you have to do an insert and an update, you can simply put a batch together like this (in T-SQL):

INSERT INTO Person (id, name) VALUES (1, 'John');
UPDATE Person SET name = 'Juan' WHERE id=2

That will execute both statements in a single shot. However, the thing gets tricky if you want to do it in a generic way. For starters, we have to use parameters, because the data-adapter needs to replace the values for INSERT/UPDATE/DELETE for each particular DataRow. So this becomes:

INSERT INTO Person (id, name) VALUES (@id, @name);
UPDATE Person SET name = @name WHERE id=@id

Now this doesn’t really work, because we’re using the same parameter names, but we actually want different values (the INSERT line wants the values from one DataRow, and the UPDATE line from another DataRow). So we have to rename the parameters so they are unique. Renaming parameters means that before the adapter executes any command, it has to parse the SQL statement (at least enough parsing to isolate parameter markers) and replace the original parameter names with new ones so their names don’t clash. That’s bad, parsing SQL statements is problematic and the operation consumes CPU and causes more allocations.

There are also two more subtle issues with this approach:
1. It turns out that some databases such as SQL Server have an upper limit to the number of parameters that you can pass to a single parametrized statement invocation. In the case of SQL Server this is around 2100. That means that your batch cannot sum up more than those parameters total (not per statement). Since each (INSERT/UPDATE/DELETE) command can have a different number of parameters, it’s hard to predict how many rows you’ll be able to process.

2. The second side-effect has to do with plan caching. Database servers usually cache query plans for efficiency. Batching can cause trouble here. For example, let’s say you have the two commands we used above for INSERT and UPDATE setup in a data-adapter. In one call, you have 1 updated row and 2 inserted rows, so your batch becomes (simplified version):

UPDATE Person SET name = @name1 WHERE id=@id1;
INSERT INTO Person (id, name) VALUES (@id2, @name2);
INSERT INTO Person (id, name) VALUES (@id3, @name3);

Now, if right after that you get another request with the exact same statements, but with a table that has 1 insert followed by 1 update, the batch will be:
INSERT INTO Person (id, name) VALUES (@id1, @name1);
UPDATE Person SET name = @name2 WHERE id=@id2;

That is, for every permutation of changes in a DataTable we’ll produce a different batch. That will bloat the server’s procedure cache introducing inefficiencies. This is the approach that we used for batching in for both SqlClient and OracleClient during Whidbey beta 1. In beta 2 OracleClient still uses this technique (mapped to its PL-SQL equivalent), but SqlClient doesn’t.

Batching implementation strategy #2
The other way of doing batching is “simpler”, but happens at a lower level. I’m going to describe the SQL Server implementation here, but other database servers might have a similar feature. At the client-server protocol level (TDS), each command is executed against the server using a “request”. For parametrized and batched statements, this request is an “RPC request” (nothing to do with the DCE RPC, just same name). It turns out that in SQL Server we can put one RPC request on the network, and right before putting the “we’re done” mark on the wire, we can actually put a “here comes another RPC” mark. This way, we can keep sending RPCs over the wire all together. Once we sent all of them, we switch to listen for responses, and we get the responses for each RPC in the order we sent them. The protocol has the appropriate infrastructure that allows us to distinguish when a response for a given RPC ends and the next starts.

This means that we get rid of the following problems:
- We can send multiple request for statement execution in a single network round-trip, which was the original goal
- No need to parse statements and rename parameters, because each statement goes in a different request
- No procedure cache bloat, because regardless of the order of execution, each row has an RPC of its own
- No limit to the number of rows that can be updated with a single batch, because we don’t depend on lots of parameters for a single RPC

There are some minor specific drawbacks to RPC-level batching, such as paying a little bit of extra cost in the server per-RPC execution, but the benefits above in general out-weigh those.

Sure, but is it really batching?
With all this background in place, now I can tackle the initial question: does batching really work? How can I see it working?
If you have a SqlDataAdapter and set the UpdateBatchSize property to 0 (meaning all changes in a single round-trip) or a value greater than 1 (indicating the number of rows per batch), we’ll enable batching. The first thing you’ll want to do is use SQL Profiler to see how the statements are being batched. However, you’ll find out that you still see one RPC execution for each DataRow that needs an INSERT/UPDATE/DELETE.

For example, let’s take this trivial code snippet that uses batching:

DataTable t = new DataTable();
t.Columns.Add("a", typeof(int));
// don't hardcode connection strings in your apps, this is just a quick sample :)
// pre-create the table with "CREATE TABLE t (a INT)"
using(SqlConnection conn = new SqlConnection("server=tcp:localhost; integrated security=true; database=test")) {
SqlDataAdapter da = new SqlDataAdapter(null, conn);
da.InsertCommand = new SqlCommand("INSERT INTO t VALUES (@a)", conn);
da.InsertCommand.Parameters.Add("@a", SqlDbType.Int, 0, "a");
da.InsertCommand.UpdatedRowSource = UpdateRowSource.None;
da.UpdateBatchSize = 10;

The filtered SQL profiler output will look like this:

SQL Profiler Trace

What’s happening there is that you effectively have a separate RPC for each statement, so the profiler is accurate. Batching is actually happening at a lower level in SQL Server. If you really want to see batching working, you’ll have to use a lower-level tool such as “netmon” or some other network monitor. Using netmon you would be able to see that all the RPCs are effectively sent together, and then all the responses are also sent together by the server. For example, for this trivial program that does 2 inserts in a single batch, here is the netmon trace for the packet that contains the execution:
00000000 00 06 5B BD 30 E5 00 30 6E 4C 31 63 08 00 45 00 ..[.0..0nL1c..E.
00000010 01 20 B9 EC 40 00 80 32 11 13 9D 37 79 43 9D 37 <....@..2...7yC.7>
00000020 7A FA E9 BA 57 A2 00 00 01 0D 12 1F 05 99 83 57 z...W..........W
00000030 47 89 AF 31 33 AF 50 18 42 C8 36 94 00 00 03 01 G..13.P.B.6.....
00000040 00 DB 00 00 01 00 FF FF 0A 00 00 00 00 00 E7 32 ...............2
00000050 00 09 04 D0 00 34 32 00 49 00 4E 00 53 00 45 00 .....42.I.N.S.E.
00000060 52 00 54 00 20 00 49 00 4E 00 54 00 4F 00 20 00 R.T...I.N.T.O...
00000070 74 00 20 00 56 00 41 00 4C 00 55 00 45 00 53 00 t...V.A.L.U.E.S.
00000080 20 00 28 00 40 00 61 00 29 00 00 00 E7 0C 00 09 ..(.@.a.).......
00000090 04 D0 00 34 0C 00 40 00 61 00 20 00 69 00 6E 00 <...4..@.a...i.n>.
000000A0 74 00 02 40 00 61 00 00 26 04 04 01 00 00 00 80 t..@.a..&.......
000000B0 FF FF 0A 00 00 00 00 00 E7 32 00 09 04 D0 00 34 .........2.....4
000000C0 32 00 49 00 4E 00 53 00 45 00 52 00 54 00 20 00 2.I.N.S.E.R.T...
000000D0 49 00 4E 00 54 00 4F 00 20 00 74 00 20 00 56 00 I.N.T.O...t...V.
000000E0 41 00 4C 00 55 00 45 00 53 00 20 00 28 00 40 00 A.L.U.E.S...(.@.
000000F0 61 00 29 00 00 00 E7 0C 00 09 04 D0 00 34 0C 00 a.)..........4..
00000100 40 00 61 00 20 00 69 00 6E 00 74 00 02 40 00 61 @.a...i.n.t..@.a
00000110 00 00 26 04 04 02 00 00 00 01 02 03 04 05 06 07 ..&.............
00000120 07 06 90 CE 9B 4E 42 21 15 E4 56 DD 44 E4 .....NB!..V.D.

As you can see, both inserts are in the same network packet. No need to understand the whole thing – it’s enough to notice that there are two insert statements in the payload.I hope this helps clarify doubts around the topic. If you have further questions feel free to post them here and we’ll take a look.

Pablo Castro
Program Manager - ADO.NET Team
Microsoft Corp.

Disclaimer: This posting is provided "AS IS" with no warranties, and confers