Subqueries in BETWEEN and CASE Statements
Consider the following query:
CREATE TABLE T1 (A INT, B1 INT, B2 INT)
CREATE TABLE T2 (A INT, B INT)SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) BETWEEN T1.B1 AND T1.B2
Observe that the subquery in this query only needs to be evaluated once for each row of T1. Indeed running on SQL Server 2000, we get the following plan (with the portion of the plan corresponding to the subquery in bold):
|--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1004]))
** |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1014]=0) then NULL else [Expr1015]))
| |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1014]=COUNT_BIG([T2].[B]), [Expr1015]=SUM([T2].[B])))
| |--Sort(ORDER BY:([T2].[A] ASC))
| |--Table Scan(OBJECT:([T2]))
** |--Filter(WHERE:([Expr1004]<=[T1].[B2]))
|--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1004]))
|--Table Scan(OBJECT:([T1]))
Now let's look at the query plan we get (with SQL Server 2005 or SQL Server 2008):
|--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008], [Expr1014]))
|--Merge Join(Inner Join, MERGE:([T2].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T2].[A]))
** | |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))
| | |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1026]=COUNT_BIG([T2].[B]), [Expr1027]=SUM([T2].[B])))
| | |--Sort(ORDER BY:([T2].[A] ASC))
| | |--Table Scan(OBJECT:([T2]))
| |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))
| |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1028]=COUNT_BIG([T2].[B]), [Expr1029]=SUM([T2].[B])))
| |--Sort(ORDER BY:([T2].[A] ASC))
| |--Table Scan(OBJECT:([T2]))
** |--Filter(WHERE:([Expr1014]<=[T1].[B2]))
|--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
|--Table Scan(OBJECT:([T1]))
Notice that the subquery is actually evaluated twice. There are two scans of T2, two sorts, and two stream aggregates. I've highlighted both sets of operators in bold.
Why is SQL Server evaluating the subquery twice? The answer is that SQL Server transforms "X BETWEEEN Y AND Z" into "X <= Y AND X >= Z". If as in this query, X is a subquery, the subquery is repeated:
SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) >= T1.B1 AND
(SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) <= T1.B2
This transformation occurs very early in the processing of the query. Unfortunately, after that point, SQL Server 2005 and SQL Server 2008 do not realize that the subquery is a common subexpression and they evaluate it as if there were two completely different subqueries.
You may also have noticed that, instead of scanning T1 first and then evaluating the subquery for each row of T1, this plan actually scans T2 first. This join order results from the optimizer decorrelating the subqueries.
So, how can we get SQL Server to evaluate the subquery only once? There are actually multiple solutions and all involve rewriting the query to calculate the subquery separately from the BETWEEN clause so that when SQL Server transforms the BETWEEN clause, it does not also duplicate the subquery. Here are some examples:
SELECT Q.A, Q.B1, Q.B2
FROM
(
SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
FROM T1
) Q
WHERE SUM_B BETWEEN Q.B1 AND Q.B2SELECT T1.*
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q
WHERE Q.SUM_B BETWEEN T1.B1 AND T1.B2SELECT T1.*
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A AND Q.SUM_B BETWEEN T1.B1 AND T1.B2
All three of these rewrites produce the same plan:
|--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008]))
** |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
| |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1016]=COUNT_BIG([T2].[B]), [Expr1017]=SUM([T2].[B])))
| |--Sort(ORDER BY:([T2].[A] ASC))
| |--Table Scan(OBJECT:([T2]))
** |--Filter(WHERE:([Expr1008]<=[T1].[B2]))
|--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
|--Table Scan(OBJECT:([T1]))
SQL Server also transforms a CASE statement of the form:
CASE X
WHEN Y1 THEN Z1
WHEN Y2 THEN Z2
...
ELSE ZN
END
Into:
CASE
WHEN X = Y1 THEN Z1
WHEN X = Y2 THEN Z2
...
ELSE ZN
END
Thus, CASE statements can yield the same problematic behavior if X is a subquery. Unfortunately, with a CASE statement, the number of times that the subquery is reevaluated depends on the number WHEN clauses and can be quite large. Here is a simple query that illustrates the problem:
SELECT *,
CASE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A)
WHEN T1.B1 THEN 'B1'
WHEN T1.B2 THEN 'B2'
ELSE NULL
END CASE_B
FROM T1
Here is the SQL Server 2000 plan:
|--Compute Scalar(DEFINE:([Expr1007]=If ([Expr1004]=[T1].[B1]) then 'B1' else If ([Expr1004]=[T1].[B2]) then 'B2' else NULL))
|--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
|--Table Scan(OBJECT:([T1]))
|--Hash Match(Cache, HASH:([T1].[A]), RESIDUAL:([T1].[A]=[T1].[A]))
** |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1020]=0) then NULL else [Expr1021]))
|--Stream Aggregate(DEFINE:([Expr1020]=COUNT_BIG([T2].[B]), [Expr1021]=SUM([T2].[B])))
|--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))**
And here is the SQL Server 2005 and SQL Server 2008 plan:
|--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1008]=[T1].[B1] THEN 'B1' ELSE CASE WHEN [Expr1014]=[T1].[B2] THEN 'B2' ELSE NULL END END))
|--Nested Loops(Inner Join, PASSTHRU:([Expr1008]=[T1].[B1]), OUTER REFERENCES:([T1].[A]))
|--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
| |--Table Scan(OBJECT:([T1]))
** | |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1030]=(0) THEN NULL ELSE [Expr1031] END))
| |--Stream Aggregate(DEFINE:([Expr1030]=COUNT_BIG([T2].[B]), [Expr1031]=SUM([T2].[B])))
| |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))
|--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1032]=(0) THEN NULL ELSE [Expr1033] END))
|--Stream Aggregate(DEFINE:([Expr1032]=COUNT_BIG([T2].[B]), [Expr1033]=SUM([T2].[B])))
|--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))**
Finally, the same solutions once again apply:
SELECT Q.A, Q.B1, Q.B2,
CASE Q.SUM_B
WHEN Q.B1 THEN 'B1'
WHEN Q.B2 THEN 'B2'
ELSE NULL
END CASE_B
FROM
(
SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
FROM T1
) QSELECT T1.*,
CASE Q.SUM_B
WHEN T1.B1 THEN 'B1'
WHEN T1.B2 THEN 'B2'
ELSE NULL
END CASE_B
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) QSELECT T1.*,
CASE Q.SUM_B
WHEN T1.B1 THEN 'B1'
WHEN T1.B2 THEN 'B2'
ELSE NULL
END CASE_B
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A
Comments
Anonymous
June 27, 2008
The comment has been removedAnonymous
June 28, 2008
The comment has been removedAnonymous
July 02, 2008
I forwarded your feedback to the appropriate individuals. Nevertheless, I would still strongly recommend that you submit your feedback directly at http://connect.microsoft.com/. See also http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=336002.CraigAnonymous
July 02, 2008
The comment has been removedAnonymous
July 08, 2008
Great blog with excellent information. I agree that this is "hard to believe" behavior but it is great to know about it so that we can address things appropriately. Now if only I could find a way to programatically find where in code these things would be occurring. : )I will certainly put my vote in on the Connect site.Anonymous
November 08, 2010
Very usefull blog, thank youAnonymous
December 30, 2012
Can any of these work arounds be regarded as guaranteed behaviour or is using a variable the only sure fire way to prevent re-evaluation as per the earlier Connect Item?Anonymous
January 02, 2013
The BETWEEN clause and CASE statements are expanded fairly early in the processing of these queries so the workarounds should be fairly safe. However, I'm reluctant to make any absolute guarantee as plans can and do change based on a variety of factors. HTH Craig