Wide vs. Narrow Plans

Here's another case where you might see intermittently poor performance that is "by design".  Suppose you see that a delete, insert, or update query in a stored proc usually runs quickly, but occasionally the query takes much longer to complete. You captured a detailed profiler trace of both the good and bad periods, including the Showplan Statistics event. The offending delete is in a stored proc, and it couldn't be simpler: "DELETE FROM t1 WHERE c1 = @p1". One of the plans you see for this statement looks like this:

DELETE FROM t1 WHERE c1 = @p1
|--Clustered Index Delete(OBJECT:([mydb].[dbo].[t1].[idx1]), WHERE:([t1].[c1]=[@p1]))

And the other plan looks something like this:

DELETE FROM t1 WHERE c1 = @p1
|--Sequence
|--Index Delete(OBJECT:([mydb].[dbo].[t1].[idx2]))
| |--Sort(ORDER BY:([t1].[c2] ASC, [t1].[c1] ASC, [Bmk1000] DESC))
| |--Table Spool
| |--Clustered Index Delete(OBJECT:([mydb].[dbo].[t1].[idx1]), WHERE:([t1].[c1]=[@p1]))
|--Index Delete(OBJECT:([mydb].[dbo].[t1].[idx3]))
| |--Sort(ORDER BY:([t1].[c3] ASC, [t1].[c1] ASC, [Bmk1000] DESC))
| |--Table Spool
|--Index Delete(OBJECT:([mydb].[dbo].[t1].[idx4]))
|--Sort(ORDER BY:([t1].[c2] ASC, [t1].[c3] ASC, [t1].[c1] ASC, [Bmk1000] DESC))
|--Table Spool

(The number of branches may vary -- there will be one for each nonclustered index.)

 

These are two very different-looking plans for the same, very simple (DELETE FROM t1 WHERE c1 = @p1) query. What's going on here? 

 

When a SQL query plan needs to delete a clustered index row, it can let the storage engine take care of cleaning up (deleting) the corresponding rows in all the nonclustered indexes. (Similarly, in the case of an insert or update, the storage engine is capable of automatically inserting or updating rows in the affected nonclustered indexes.)  This is a simple approach, but the problem with this is that the nonclustered index rows are deleted in clustered index key order, not nonclustered key order, which means that SQL has to jump all around the nonclustered index in a more or less random fashion deleting one row at a time here, a second one way over there, etc. If there are a large number of rows being deleted, the disk seeks to reposition the head back and forth across the nonclustered indexes can be very expensive. In these cases (when SQL knows that it has to modify a lot of rows from a clustered index), the query optimizer has the option of choosing to include the logic needed to delete all the corresponding nonclustered index rows in the query plan as explicit Index Delete operators (see the second plan above for an example). In this case the QO can do fancier things than the storage engine knows how to do, like sorting the data set in nonclustered index key order before the Index Delete; this means that all the nonclustered index keys can be removed in a single, ordered pass over the nonclustered index, which means less drive head movement and therefore a smaller amount of disk wait time. This is a neat trick, but the overhead of setting up and performing all the sorts and having the NC index deletes happen at a higher level in the engine has a certain amount of overhead, so it only makes sense for the QO to handle index maintenance itself if a large number of rows will be deleted. (There are other factors that also come into play like the number of nonclustered indexes on the table, but for a fixed schema and query, the key variable is the number of rows that SQL estimates will be affected by the query.) 

 

In this particular case, we have a parameterized DELETE query. The number of rows that will be deleted depends entirely on the specific parameter value that is passed into the query. Whether the QO will choose to compile a plan that does the index maintenance in the plan (this is called the "wide plan") or leaves this up to the storage engine (a "narrow plan") depends on the number of rows that will be deleted. If a parameter value that only qualifies a small number of rows is passed in when the plan for the proc isn't in cache, the QP will select a narrow plan. This is most likely the most efficient plan for that particular parameter value. Unfortunately it may not be an efficient plan for other parameter values that may delete thousands of rows, but because the query is in a stored proc, the plan will be reused for subsequent executions until something kicks the plan out of cache (that something could be a reindex, a schema change to a table, a DBCC FREEPROCCACHE, or just normal cache maintenance to free up memory for new plans). Plan dependency on the parameters that are passed in for the execution that triggers the compile is called parameter sniffing. (Param sniffing can lead to a host of common perf issues -- I may talk more about some of these in future entries.)

So, the basic problem here is the result of parameter sniffing combined with the choice of wide vs. narrow plans. A few possible solutions (definitely not an exhaustive list):

  • Use local variables instead of parameters in the INSERT/DELETE/UPDATE query so that the plan selected doesn't depend on param value.
  • Define the proc WITH RECOMPILE (or use an OPTION(RECOMPILE) query hint if you're on SQL 2005) so that everyone gets a plan tailored to their particular parameter.
  • Use an "OPTIMIZE FOR (@p1=<a "typical" param value>)" query hint to tell the optimizer to always optimize for a parameter value that is typical of the type that will most commonly be passed in
  • Use a USE PLAN hint, possibly with a plan guide, to force one of the two plan possibilities

Below is a simple script that shows the two types of index maintenance plans; you can play around with this to see them in action. After looking at the plans, try omitting the DBCC FREEPROCCACHE to see how perf is affected when a wide plan is reused to modify a small number of rows, or when a narrow plan is reused to modify a large number of rows.

USE tempdb
GO
IF @@TRANCOUNT > 0 ROLLBACK TRAN
IF OBJECT_ID ('t1') IS NOT NULL DROP TABLE t1
IF OBJECT_ID ('myproc') IS NOT NULL DROP PROC myproc
GO
CREATE TABLE t1 (c1 int, c2 int, c3 int)
GO
SET NOCOUNT ON
GO
DECLARE @x int
SET @x = 1
BEGIN TRAN
WHILE @x <= 100000
BEGIN
IF @x % 250 < 100
INSERT INTO t1 (c1, c2, c3)

    VALUES (0, @x % 400, @x % 1000)
ELSE
INSERT INTO t1 (c1, c2, c3)

    VALUES (@x % 1000, @x % 400, @x % 1000)
IF @x % 5000 = 0
BEGIN
RAISERROR ('Inserted %d rows...', 0, 1, @x) WITH NOWAIT
COMMIT TRAN
BEGIN TRAN
END
SET @x = @x + 1
END
WHILE @@TRANCOUNT > 0 COMMIT TRAN
GO
CREATE CLUSTERED INDEX idx1 ON t1 (c1)
CREATE INDEX idx2 ON t1 (c2)
CREATE INDEX idx3 ON t1 (c3)
CREATE INDEX idx4 ON t1 (c2, c3)
GO
CREATE PROC myproc @p1 int AS
UPDATE t1 SET c1 = c1 WHERE c1 = @p1
GO
DBCC FREEPROCCACHE
GO
-- First delete a small number of rows: let storage

-- engine take care of index maintenance
SET STATISTICS PROFILE ON
EXEC myproc @p1 = 126
SET STATISTICS PROFILE OFF
GO
-- Be sure to flush the proc cache or we will reuse the plan

-- that was compiled for a much smaller number of rows on the

-- last execution.
DBCC FREEPROCCACHE
GO
-- Delete a larger number of rows: handle index maintenance

-- explicitly in query plan
SET STATISTICS PROFILE ON
EXEC myproc @p1 = 0
SET STATISTICS PROFILE OFF
GO

If you look closely at the wide plan you might notice something curious: only one of the Table Spool operators in the plan has a child. The purpose of a table spool is to store a set of rows in an intermediate place so that they can be used more than once in the plan without having to execute the same query subtree over and over. Spools are typically used either for a perf boost or for Halloween protection; in this example, the spool fulfills both of these needs. Table spools require a child -- remember that the purpose of a spool is to cache rows, so they must have an input that provides the rows to cache. Any time you see a spool without a visible child, it's a spool that was created elsewhere in the plan, and is being reused. In this plan, only the first appearance of the first spool shows the spool's child -- the content of the spool is then reused by the subsequent index delete operators. (Unfortunately, if there are multiple spools in a plan, there's no obvious indicator about which one is referenced by each of the "childless" spools.)

 

If you examine the wide update plan from the script above, you might notice a Split operator. This guy is poorly documented, but pretty cool -- he provides both Halloween protection and a performance boost (the benefits may sound similar to those of the Table Spool, but the mechanism is completely different). If I have time, in a future post I may explain how Split works to provide these benefits.

 

A final footnote: In the comments above I mention that the storage engine is responsible for index maintenance when executing a narrow plan. Technically, this is only true in SQL 2000. In SQL 2005, both wide and narrow plans are performed by layers above the storage engine. The different plan types still exist in SQL 2005, however, and the concepts and terms "wide plan" and "narrow plan" still apply to SQL 2005.