What you can and can't do with the Messagebox Database server.

This question has come up a lot in internal and external discussions. It got to the point where we had to put out a KB article as so many DBAs out there thought that it was okay to go and "tweak" the settings (https://support.microsoft.com/kb/912262/en-us). So here is an attempt on my part to explain what you can and can't do in relatively simple terms and why.

What can't you do?

You may not make any changes which could effect how our queries are executed (ie their execution plan). I will list some specific things which can affect the query plans, but they are just examples. If you know of a setting which I don't list and you aren't sure, then don't change it. Here are a couple of things to understand around this:

1) Think of the sql code (stored procedures) which we ship as uncompiled C# (or whatever your favorite language is) code. Every customer who installs our product gets their own compiled version of this code courtesy of the SQL Server engine. You can read up on documentation on how the sql engine performs this compilation (or go to one of Paul Randall's talks at Tech Ed or some other conference) but it is important to realize that it is being done. SQL even has a performance counter which you can monitor called recompiles which will show you how often "code" is being recompiled potentially while your applicatin is running at full throttle. SQL Server uses various statistics about the table layout and cardinality of indexes and such to attempt to chose the best plan to reduce IO and provide the best performance. One thing that the optimizer cannot anticipate is the locking side-effects certain plan choices could have. It is therefore possible that in an environment in which lots of threads are accessing the database at very high rates, a plan which might be optimized for potential IO access could cause tremendous locking and hence be a much worse overall plan choice.

2) The next very, very important thing to understand is that while the database is installed on your server, it is not "your" database. It is our (or if you catch me on the wrong day, not thinking, "my") database. When you have performance problems with it, the buck stops over here, not at your DBAs desk. For that reason, it is unbelievably important that the system runs the same on every installation. If every customer who called in with a question was forced to gather a dump of the plan cache from within SQL Server and send it to us so that we could reconstruct exactly what sql server is doing, it would make support pretty much impossible.

3) Access to the messagebox is very controlled and very limited. Via our Admin MMC (in bts 2006) you can construct somewhat flexible queries against the data, but even that is somewhat limited. From the runtime, there are a set of precanned generic stored procedures and some per-host templated stored procedures which provide all the access that the runtime needs. Since we have locked down access to the database, we can understand all of the potential queries. With that in mind, we have done everything we possibly can to "hardcode" the query plans for all of our stored procedures. This means using index hints, force order hints, occassionally join hints, norecompile hints (why recompile if you are just going to generate the same plan) and every trick we can come up with the tie the hands of the optimizer (part of the sql engine which calculates the best query plan). The sql server team is well aware of what we are doing. Technically in SQL 2005 there is support for USE PLAN where we could really hardcode the plan completely, but (in order of importance): (a) we support SQL 2000, (b) it is only supported on SELECT statements which only ~30% of our code and (c) it is non-trivial for generated stored procedures like the ones we use for per host access. Along with attempting to force the optimizer to chose the plan we want, we also disable some of the features of SQL Server which "help" the optimizer chose the best plan. This includes statistics generation and updating. Statistics provide information about the layout of a table so that the optimizer might decide it is more efficient to scan the clustered index than to seek over the non-clustered index (as an example). Since we are hardcoding the plan, statistics are not so important. We also disable parallelism (setting at the server level MAX DOP to 1). Parallelism provides more options for the optimizer to chose amongst for plans. Typically it is very useful for queries over very large databases (warehouse type applications). We are a OLTP (online transaction processing) style app which executes the same sets of sprocs on multiple threads at very high rates. We already provide our own parallelism. I "apologize" that we chose to set this at the server level which could cause it to effect any other database you have installed on that server, but you may not change this setting or else we will simply push back on any support until you switch it back (no matter how large a customer you are). If you have only one server, install an alernate instance of sql server on the same machine and put us on the separate instance. In our testing, when SQL Server choses a plan with parallelism in it for one of our queries it can cause orders of magnitude performance degradataion.

4) We are not perfect (but we try :). Occassionally we do not include enough hints so that the optimizer can chose another plan which is worse. I have personally fixed two bugs in which we needed to add hints or slightly change the structure of our query because the optimizer chose a less than optimal plan. It can happen, but we do our darned'ist and I have only had I think 3 cases of it in about 3 years which is not bad. It also not fair to blame this on the optimizer. As I said above, since we are hardcoding the plans for all of our queries, we turn off a lot of the features which might help the optimizer (but add overhead to the overall system).

A quick list of things which can effect the query plans:

Statistics (don't enable these)

Parallelism (don't turn this on )

table structure (don't add indexes, columns, triggers, ... If you do you will hear silence when you call for help)

Stored procedures (don't change them. You can look all you want, but no touching)

 

So what can you do?

You can change the underlying storage model for our data on the physical disks. By this I mean stuff like file groups and files and such. By default all of our tables are setup on a single (PRIMARY) filegroup which has one file. This was admittedly perhaps not the best decision. You are more than welcome to create multiple files for a given file group and also to create multiple filegroups and move our tables around. This can give potentially significantly better IO performance and hence overall system performance. We are currently working on a paper (and perhaps tool to automate) which will give recommendations on how to layout out tables given as many phsical disks as you have access to. You can also change setings related to how sql pre-allocates space for the files and for some of their internal structures. None of this effects how our plans are executed, it simply allows for faster IO access and better overall performance. One interesting settings which we do not enable by default but have been playing with for a little while now is "TEXT IN ROW". This table level option is used to tell sql server whether to store the image column data in the same data page as the clustered index or in another data page. By default this is set to off. If you think about an application like Outlook / Exchange this makes a lot of sense, as the image data (you email message) is potentially only loaded when the user clicks on a specific email to read it (disregarding preview pane). Storing the image data in a separate page allows more rows to be stored in one page and hence less IO is needed to read a large set of data. However, in BizTalk, when a message is delivered to a subscribing service, that message is going to be "processed" and hence loaded every time (we do special handling for large messages to fragment them already). So instead of an optimization, we incur an extra IO to lookup the image data in the separate data page. I don't gaurantee it will always provide a performance benefit, but we have seen some decent gains when processing small messages. As always you should test this out in a QA environment first. You can read more about text in row simply by searching for "SQL SERVER text in row" and picking one of the links. Text in row is enabled at the table level, so you need to have some understanding of our table structures. Really, there are only two tables we need to worry about. The first table is the Spool. This table contains for every message, its MessageID, a timestamp, some other properites, and the blob form of the message context. The second table you need to worry about is called the parts table. This table contains both the part property bag (for things like content type and charset) as well as the first fragment for each part. Enabling text in row on these tables could provide a benefit for high throughput systems which are processing smaller messages. Since the size of a page can only be 8K, if your messages are greater than this, enabling text in row probably won't help since sql will be forced to push the data to a separate page anyways (although we do compress your data so it is smaller in storage than you might think). You can also play around with the max size for the image column before it is offloaded to a separate page (https://msdn2.microsoft.com/en-us/library/ms173530.aspx). As always, have someone with database experience read over this and understand it before testing anything out. A quick script to turn on the option would look like:

exec sp_tableoption N'Spool', 'text in row', 'ON'

exec sp_tableoption N'Parts', 'text in row', 'ON‘

 

 

So the conclusion is that you should never do anything which effects the way our queries are executed but you are more than welcome to try and optimize the IO access to improve performance. Hope this helps some of you out there.

 

Lee

 

Addendum: 

I was asked about Index Rebuilds. First, we do not support online index defragmentation. That process involves page locks which we explicitly disable as part of our installation on all of our tables. Page locks make lock acquisition ordering almost impossible and will cause us to deadlock all over the place. We are very carefull about how we acquire locks and expect that the granularity is at the row level so that our ordering is enforced. Having said that, offline rebuilding of indexes is perfectly okay and supported. In our labs, we have not seen significant performance gains from doing this but it is true that our indexes will fragment since many of them are on guids (most). SQL Server team is not a huge fan of guid based indexing, but we have done numerous tests which show that as long as you don't ever scan the table, they can perform better than identity based indexes (for our specific workloads). They will of course get fragmented, but usually our tables don't get too big as data flows in and out of them at a relatively steady pace. If you have large amounts of data which is expected to build up in the messagebox, you are more than welcome to periodically do index rebuilds during scheduled downtime. The same applies to the tracking database where you can rebuild indexes during downtime.