Conor vs. Finding a Misbehaving Query

(well, one case of it – the topic is too large for a blog post, so I will focus)

I will paraphrase a question I received over the weekend from a SQL Server user:

“My server is doing fine.  all of a sudden CPU use goes to 100% and stays there.  If I do dbcc freeproccache, things get better right away.”

The fun part about problems like this is that there isn’t quite enough information to know for sure what is wrong.  So, we get to go hunting a bit.  This blog post will cover how I would hunt down a problem like this. 

Let’s start with a few basic assumptions:

* many applications have very regular query patterns – the queries are all fixed in stored procedures, and ad hoc SQL is minimal.  Even the ad hoc sql may be fairly regular such that the query plans are consistent.

* For OLTP applications, most of the operations do not scale linearly with the size of the data.  So, you are not doing a batch job to load some variable amount of data into a machine each day.  Instead, you are running many, many small transactions and your load factor is reasonably stable (such that you can make statements like “CPU was good today” and “CPU is bad now”)

With applications in this category, often you are also in a position to take a performance baseline, meaning that you capture the query plans + dm_exec_query_stats data over a period of time when the application is performing well.  Yes, you’d have to go do this by hand, but it is not too hard to query DMVs and store them off somewhere.  If I were a DBA, I would have this on every system I was supposed to maintain.  (You can capture similar information from SQL Profiler as well, but the key bits of information are the query plans and some information about the runtimes of each query.  The second part is easier with the DMV approach).

Now, when CPU goes to 100%, I would go run my script again and capture the data for the system when it is bad.  now you just need to go find the thing that has changed.  If DBCC FREEPROCCACHE fixes things, then you likely have a query plan choice issue and flushing the cache fixed things.  However, it would be really nice to KNOW.

 

OK, so let’s assume that you don’t have this information.  Say you are sent in cold to look at a system that has no history.  The best you can do is to look for the slow parts and to examine the plans for those slow queries.  (I personally have the additional benefit of seeing lots of systems, so I can often just guess the problem based on how frequently I see each issue).  I will try guessing.

Often, when I see plan flips that go away when the plan is recompiled, it is an issue with parameter sniffing.  I’ve blogged about the quirks of the existing product’s parameter sniffing behavior before, and this can cause problems to some applications.  The most common plan change I see would be something like this:

(inside of a stored procedure where @p is passed as an argument to the sproc)

SELECT * FROM MyTable WHERE col = @p

The parameter @p will be compiled with the value that was passed into the stored procedure scope @p when the stored procedure was initially compiled.  Additionally, the column col has some very frequent values.  We have the pieces now to hit a parameter sensitive plan issue.

Sometimes, the value passed to @p is highly selective and the optimizer will sniff the value and decide to generate a plan that seeks on an index MyTable (perhaps also doing a fetch back to the base table to get the columns not in the index).  This works great if the WHERE clause only selects a very small number of rows. 

Other times, the value passed to @p will be non-selective.  If the optimizer uses this predicate, it determines that the number of rows to be returned from the filter are high enough that doing bookmark lookups is not the best plan choice for this value (bookmark lookups are random IOs on rotational media, and these are slower than sequential scans of pages off the disk).  So, there is a tipping point where the optimizer will pick a scan plan and just filter rows coming off the scan.

 

The optimizer does not really have a good notion of the calling pattern for future values of the parameter.  So, it caches the plan and just uses it.  It does not have logic in SQL 2008 R2 or below to do anything when the calling pattern differs from the value sniffed during compilation.  So, if you get the seek plan and you often query for parameter values that return many rows, this can cause the query to run more slowly (either with more IO for a cold cache or with more CPU for a warm cache).  The scan plan can cause CPU to go higher based on the calling pattern as well. 

So, my guess is that this is a parameter sniffing problem where the calling pattern differs greatly from the value sniffed during compilation.  (I may be completely wrong, but this is probably the most common issue I see when people start a conversation like this with me).

The last part is to find the smoking gun – let’s prove it.  I like to do this by running the query with each parameter value and recording runtimes, if possible.  This can reproduce the 100% CPU issue if you do it with enough calls.  If you have found the query, then you need to get the SHOWPLAN XML out and look at the text of it.  In there, the QP will record the sniffed parameter values used during compilation.  This should be enough to track down the query plan flip-flopping.

Once you have found the guilty query plan, I would usually hint it.  Most customers have a specific plan they like – often the “seek” plan,  While there may be some cases where the seek plan is suboptimal, they are often less bad than the scan plan (especially for big tables).  So, if you are willing to have the semantics of the seek plan, then force the seek plan (either with a plan guide, an OPTIMIZE FOR @p=<value> hint, or even using the FORCESEEK hint if you have a single column index seek).  Note that FORCESEEK does not do every kind of seek you might want, so please research it before blindly applying it.

If you actually need two query plans, then you may need to write T-SQL code to split the cases based on the values of @p such that there are two cache entries and each one gets the plan shape you want.  I see this very infrequently, but it does happen. 

Happy Querying!

Conor