Performance Testing the Microsoft OLE DB Provider for DB2 (Part 1)
Technorati Tags: DB2,SQL Server,SSIS,Microsoft OLE DB Provider for DB2,Host Integration Server,Data Providers,Performance,OLEDB
There are many factors that can affect pulling data from DB2 into SQL Server (or other applications). These include the network speed, hard drive; machine CPU speed, memory, # of CPUs, loads (on DB2 or SQL). There are way too many factors to consider.
One thing that we offer is the ability to pre-fetch data from the DB2 server before the consumer asks for this data. This parameter is normally only used for SSIS or SSAS, but can be used by other consumers. It should ONLY be used when moving large amounts of data (we recommend millions of rows), but there are times when it can be useful in DQP applications (such as SQL Server’s Query Analyzer). The pre-fetch is controlled by a parameter on the connection string “Rowset Cache Size”. The default for this setting is 0 (off), but can be set to almost any number (we do not recommend going over 300 at this point). What the parameter allows is for us to pre-fetch data from DB2 before the consumer actually requests it. IE – if using SSIS, if SSIS asks for 10,000 rows of data, we will ask DB2 to send us 300,000 rows (if this parameter is set to 30), and stored this internally, thus avoiding delays in constantly asking DB2 for smaller chunks of data.
But, how do you know what to set this to? Well, normally we recommend 30 as a starting point, but this can be cumbersome to test.
Recently I worked a case which involved testing multiple queries across multiple linked servers, each with a separate setting for the Rowset Cache Size. I did not care about the actual data, but was concerned with performance at different settings. Manually creating queries, linked servers, gathering timing numbers, was very cumbersome at best, and I decided there had to be a better way.
So, I came up with 4 SQL Scripts that do all the work for me. They can easily be used by anyone, as being very generic, and only 1 script file needs modification. Since the scripts are generic in nature, I named these Generic Part1.sql, GenericPart2.sql, GenericPart3.sql, and GenericPart4.sql. These scripts assume you have a Database called DB2TestDatabases.
So, what do these do?
GenericPart1.sql
This script file creates 3 tables. 2 of the tables are used to create the 3rd one.
To begin with, you need to make a few changes.
Ignoring the declare section, skip down to this section:
-- ********************* BEGIN USER DEFINED VARIABLES *********************
-- ***** first, need your connection string, WITHOUT ROWSET CACHE SIZE
set @constr = N'Provider=DB2OLEDB;User ID=USERNAME;Password=password;Initial Catalog=DB2400;Network Transport Library=TCP;Host CCSID=37;PC Code Page=1252;Network Address=DB2400;Network Port=446;Package Collection=SCHEMA;Default Schema=SCHEMA;Process Binary as Character=False;Units of Work=RUW;Default Qualifier=SCHEMA;DBMS Platform=DB2/AS400;Defer Prepare=False;Persist Security Info=True;Connection Pooling=False;Derive Parameters=False;'
-- ***** supply your servername prefix, I use the name of system I'm connecting to
set @srvprefix = N'DB2400_'
-- ***** supply your SQL Querys
set @sqlquery1 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'
set @sqlquery2 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'
set @sqlquery3 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'
set @sqlquery4 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'
set @sqlquery5 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'
set @sqlquery6 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'
set @sqlquery7 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'
set @sqlquery8 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'
set @sqlquery9 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'
-- ********************* END USER DEFINED VARIABLES *********************
-- create QueryTable
-- this table is used for looping purposes
-- when creating the trials table later
-- it can contain as many queries as you wish to test with
create table [dbo].[QueryTable] (query nVarChar(2000))
insert into [dbo].[QueryTable] values (@sqlquery1)
insert into [dbo].[QueryTable] values (@sqlquery2)
insert into [dbo].[QueryTable] values (@sqlquery3)
insert into [dbo].[QueryTable] values (@sqlquery4)
insert into [dbo].[QueryTable] values (@sqlquery5)
insert into [dbo].[QueryTable] values (@sqlquery6)
insert into [dbo].[QueryTable] values (@sqlquery7)
insert into [dbo].[QueryTable] values (@sqlquery8)
insert into [dbo].[QueryTable] values (@sqlquery9)
Only a few changes are needed here to make this usable for your needs.
For the variable @constr, this is set to your connection string, but without the “RowSet Cache Size=x” parameter. This will be appended on later, when the linked servers are created.
For the variable @srvprefix, this can be anything you want. I use the name of the DB2 server I want to connect to. This will become the name of the linked server, in incremental values (more on that later).
Next, enter the SQL Queries you want to run. In my case, I wanted to run 3 queries, with each of them pulling in 10,000, 100,000 and 1,000,000 rows of data. Nice and simple, but you can use any query you want here. I also wanted them to run 3 times each, so I could average the times to get a better idea of performance over time.
Once you have the queries, make sure the insert statements match up. In my case, 9 inserts. This piece of code creates a table that is read in later, in a loop, for building the test to be run later.
After the above is completed, you are done with any modifications you need to make to the script files.
In the next part of the script, we created another table which contains the link server names and the final connection string for each linked server:
-- Create Linked Server names in a table
-- tables contains the name of the linked server
-- we also construct the data provider connection string here
-- appending the RowSet Cache Size setting it here from 0 to 300 in increments of 30
create table [dbo].[LinkServerNames] (servername Nvarchar(255), constr Nvarchar(2000))
set @i = 0
while @i < 301 -- creating linked server names for rowset 0, 30, 60, 90, 120, 150, 180, 210, 270,300
begin
set @istr = cast(@i as nvarchar(3))
-- set connection string proproperty now
set @realconstr = @constr + 'Rowset Cache Size=' + @istr + ';'
set @j = len(@istr)
-- and now build istr up to 3 characters
-- for linked server name
set @rowstr = '000'
set @rowstr = stuff(@rowstr,4-@j,@j,@istr)
-- insert data
insert into [dbo].[LinkServerNames] values(@srvprefix + @rowstr, @realconstr )
set @i = @i + 10 -- increment value (was 30)
end
As I mentioned, the table contains 2 columns, the linked server name, and the connection string. These are built in a loop, which, in this case, will create 31 rows. The linked server names will use the prefix you put in earlier, plus a 3 column number. In my case, I would have DB2400_000, DB2400_010 ….DB2400_300. Also, the final connection string is setup here, appending the “Rowset Cache Size” parameter to the connection string entered earlier.
After the above is run, the next part of this script creates the linked servers. I will leave out that part of the script here, but it creates and executes SQL Commands similar to what you would do if creating the linked server manually (the commands are similar to what you would get if you scripted the linked server from SSMS to a query window).
Finally, this script creates the last table. This table will be used by both GenericPart2.sql and GenericPart4.sql.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TRIALS]') AND type in (N'U'))
DROP TABLE [dbo].[TRIALS]
CREATE table [dbo].[TRIALS]
(trialname Nvarchar(255) -- test name
,servername Nvarchar(255) -- linked server name
,command Nvarchar(2000) -- command to run
,StartTime DateTime -- start time
,EndTime DateTime -- end time
,durationMs int -- run time in millisecnds
)
The table contains a test name (built on loop numbers), the linked server name, and the command which will be run. It also contains columns for the start and end times of the test, and the duration (in milliseconds) of the test.
Here are a few sample columns created in the table:
Trial #: 001-001 |
DB2400_000 |
select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY |
NULL |
NULL |
0 |
Trial #: 001-002 |
DB2400_010 |
select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY |
NULL |
NULL |
0 |
Trial #: 001-003 |
DB2400_020 |
select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY |
NULL |
NULL |
0 |
In the above, you can see that the 1st ‘run’ will be using a query to pull in 10,000 rows, against multiple linked servers. The linked servers in the above have Rowset Cache Size set to 0, 10 and 20. In my case, I have 279 separate rows. 3 separate queries repeated 3 times with 31 linked servers.
GenericPart2.sql
This is the workhorse of the test scripts provided here.
This script opens a cursor against the TRIALS table (created above), reads in the linked server name and query, and builds an exec statement, then executes it as follows:
DECLARE trial_cursor CURSOR FOR
SELECT trialname, servername, command
FROM [dbo].[TRIALS]
Open trial_cursor
FETCH NEXT FROM trial_cursor into @trial, @server, @command
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT @trial + N' ' + @server
-- append server location to the command
SET @command= N'exec (''' +@command+ ''') at ' + @server
-- run it
SET @starttime = getdate()
exec @success=sp_executesql @command
Once the exec statement returns, the script updates the TRIALS table with the start and end times + the duration of the test, then loops back, until all tests in the table are completed.
-- NULL the end time only if it failed, otherwise set the end time
If @success=0 SET @endtime = getdate()
-- save the times and durations
Update [dbo].[TRIALS]
Set StartTime = @starttime, EndTime=@endtime, DurationMs=DateDiff(MS, @starttime, @endtime)
WHERE trialname =@trial
-- just in case of failure
SET @starttime =NULL
SET @endtime =NULL
FETCH NEXT FROM trial_cursor into @trial,@server, @command
END
CLOSE trial_cursor
DEALLOCATE trial_cursor
If an error happens, the script will stop running. There is no ‘error’ checking, as I could not find a way to catch DB2 errors if something happened, but it is possible to catch SQL Errors using a TRY-CATCH, but that requires a bit more work.
After the script runs, the TRIALS table will contain data similar to the below:
Trial #: 004-123 |
DB2400_290 |
select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY |
5/11/2009 10:47:17 PM |
5/11/2009 10:47:27 PM |
9736 |
Trial #: 004-124 |
DB2400_300 |
select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY |
5/11/2009 10:47:27 PM |
5/11/2009 10:47:37 PM |
9743 |
Trial #: 005-125 |
DB2400_000 |
select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY |
5/11/2009 10:47:37 PM |
5/11/2009 10:49:10 PM |
92610 |
Trial #: 005-126 |
DB2400_010 |
select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY |
5/11/2009 10:49:10 PM |
5/11/2009 10:50:58 PM |
108460 |
Trial #: 005-127 |
DB2400_020 |
select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY |
5/11/2009 10:50:58 PM |
5/11/2009 10:52:45 PM |
107063 |
Trial #: 005-128 |
DB2400_030 |
select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY |
5/11/2009 10:52:45 PM |
5/11/2009 10:54:29 PM |
104143 |
In the above table, you can see where test 4 (reading 10,000 rows) switched to test 5 (reading 100,000 rows). The actual test# being here were the 123rd through the 128th query being executed. The last 3 columns contain the start and stop time, plus the duration in milliseconds. Also, by looking at the linked server name, you can see that the tests were using rowsets of 290, 300, 0, 10, 20, and 30.
I would also highly recommend running this script from a command line, piping the output to nul
GenericPart3.sql
This is a simple script that cleans up most of part1 tables. It does not delete the TRIALS tables, but removes the linked servers that were created, and drops the LinkServerNames and Query tables.
GenericPart4.sql
Now, after running GenericPart2.sql, you have your table full of performance data. But, the data is intermingled. You know you should have 3 queries for 10,000 rows against each of the linked server, but how do you get that out? And, even better, can you average the 3 test times? Well, yes you can, with this script.
This script builds a pivot sql query against the trials database, using the actual query. To only get the 3 queries, it first opens up a cursor against the TRIALS database, reading in the DISTINCT commands that were issued. These are saved in a string to be used by the pivot query (the command strings become the column headers).
Next, the script builds the query. The query also takes the average of the durationMS column from the TRIALS table. The final query (in my case) would look similar to this:
Select servername
, [select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY]
, [select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY]
, [select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY]
from (select servername
, command
, durationms
from dbo.trials) AS rf
PIVOT ( avg(durationms)
FOR command IN (
[select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY]
, [select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY]
, [select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY])
) AS p
Once the script is executed, you have a nice table that can easily be reviewed to see which setting may work best for a particular query:
servername |
select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY |
select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY |
select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY |
DB2400_000 |
10798 |
103306 |
940229 |
DB2400_010 |
12305 |
113839 |
1175765 |
DB2400_100 |
10297 |
93475 |
925285 |
DB2400_110 |
10418 |
93156 |
941389 |
DB2400_120 |
10374 |
93995 |
938454 |
DB2400_130 |
10635 |
95111 |
940769 |
DB2400_140 |
10527 |
93925 |
936573 |
DB2400_150 |
10566 |
93308 |
944626 |
DB2400_160 |
10516 |
94326 |
941008 |
DB2400_170 |
10760 |
94126 |
938562 |
DB2400_180 |
10503 |
93899 |
935958 |
DB2400_190 |
10623 |
93555 |
922817 |
DB2400_020 |
11847 |
110398 |
1080601 |
DB2400_200 |
10894 |
93490 |
904607 |
DB2400_210 |
10658 |
93555 |
905882 |
DB2400_220 |
10799 |
93133 |
919812 |
DB2400_230 |
10687 |
95160 |
923053 |
DB2400_240 |
10848 |
96003 |
919006 |
DB2400_250 |
10623 |
96033 |
931469 |
DB2400_260 |
10486 |
98468 |
927744 |
DB2400_270 |
10663 |
96406 |
941534 |
DB2400_280 |
11876 |
96829 |
936327 |
DB2400_290 |
11311 |
98304 |
931216 |
DB2400_030 |
11903 |
105999 |
1068744 |
DB2400_300 |
11307 |
98747 |
927451 |
DB2400_040 |
11594 |
103962 |
1053893 |
DB2400_050 |
11363 |
103201 |
1032705 |
DB2400_060 |
11171 |
100775 |
1012864 |
DB2400_070 |
10931 |
98275 |
994035 |
DB2400_080 |
10623 |
96421 |
981286 |
DB2400_090 |
10338 |
94936 |
953186 |
I hope this helps. In part 2 and part 3 I will address performance testing SSIS and C# queries using the Microsoft OLE DB Provider for DB2.