Optimize Recursive CTE Query

2011-04-28

Author: Shaun Tinline-Jones

Reviewers: Lubor Kollar; Conor Cunningham; Steve Howard; Kun Cheng; James Podgorski; Jimmy May; Joseph Sack; Welly Lee; Neeraj Joshi; Keith Bauer

Introduction

We recently assisted a global ISV to address a performance issue related to a poor performing recursive CTE (Common Table Expression). The ISV wanted the query that was running in excess of 3 minutes to run in less than 15 seconds on their servers. The end result of our efforts was a 3,600% performance improvement.

Note: Validation for this post was performed in the SQL CAT Customer Lab on an HP Proliant DL380 G7, Westmere-EP X5680 3.33GHz 2 socket, 6 physical cores, 12 logical cores for a total of 24 cores; 144GB RAM.

A CTE (https://msdn.microsoft.com/en-us/library/ms186243.aspx), amongst other things, is a very effective single statement for handling hierarchial data, and more specifically, parent-child relationships. A CTE is a reasonable choice when the intention is to iterate through the data, particularly in cases where the schema cannot be changed but recursive queries are required. If you can design a different schema layout, there are often much faster ways to execute queries over hierarchies. CTEs are also a natural choice when converting a CONNECT BY statement from alternative RDBMS platforms (https://download.oracle.com/docs/cd/B19306_01/server.102/b14200/queries003.htm).

There is a scenario where using a CTE construct is significantly less efficient than the traditional approach to traversing the Parent-Child construct. This blog will describe how and when a WHILE loop performs better than the CTE. The characteristics that can contribute to CTEs being a less than optimal choice are:

Non-Unique Parent/Child Key
Complex Predicates

In our scenario the ISV was attempting to extract all the items that adhered to a certain set of characteristics. An item could exist on its own or be a component (child) within a larger item (parent) and there were no constraints to the number of parents a child could have. For this scenario, the query was recursively traversing the hierarchy in a non-unique fashion.

Below is the original CTE code:

WITH CTE(ItemID, StartPt, StartID, EndPt, EndID, level)
AS
(    -- Anchor member definition
    SELECT anc.ItemID, anc.StartPt, anc.StartID, anc.EndPt, anc.EndID,0
    FROM dbo.CTEIssue AS anc
    WHERE anc.StartPt = -1512034855
        AND anc.StartID = 1809069910
        AND anc.Category IN(-379491138,-379518300,-379520626,472917509,-379494005,-379506005,-2134852210,-379515627,-379515134,-379484415)
    UNION ALL
    -- Recursive member definition
    SELECT chi.ItemID, chi.StartPt, chi.StartID, chi.EndPt, chi.EndID, par.level+1
    FROM
        dbo.CTEIssue AS chi
        INNER JOIN
        CTE AS par ON(par.EndPt = chi.StartPt AND par.EndID = chi.StartID)
    WHERE
        chi.Category IN(-379491138,-379518300,-379520626,472917509,-379494005,-379506005,-2134852210,-379515627,-379515134,-379484415)

)
SELECT DISTINCT cti.*
FROM
    dbo.CTEIssue AS cti
    INNER JOIN
    CTE AS cte ON(cti.ItemID = cte.ItemID)
WHERE (cti.EndPt <> -1512034855 OR cti.Category = -379484415)
GO

Identifying the issue

Firstly, we understand that there are cases where an inefficient query is acceptable, provided that it meets the intended objectives. It only becomes an issue when the intended objectives are not attained. In this case, the inefficient CTE meets the expected response times for small data sets or simple predicates and only became an issue once the data set became large and the throughput increased. Another concept to appreciate is that there are cases where a slow operation for small data sets is significantly faster when dealing with larger data sets. For example, nested joins are great for small sets of data but are typically no match for hash joins when the data is large.

It's relatively easy to recognize that using CTEs is not the most efficient approach when the statement immediately following the CTE expression relies on the DISTINCT of all the columns. Essentially this implies that the recursive logic doesn't have a primary key in the anchor, and thereby allowing each recursive member to not be unique. This creates a scenario where a huge number of duplicate rows are generated.

Another way to recognize that a CTE is inefficient is to compare the estimated query plan statistics to the actual statistics. The below table highlights the difference between the query plan Estimates versus Actuals.

Estimate Rows	Estimate Executions	Rows	Executes	TruncatedOperatorText (First 60 characters)
183.53	NULL	3	1	WITH CTE(ItemID, StartPt, StartID, EndPt, EndID, level) AS
35.22541	1	3	1	\|--Nested Loops(Inner Join, OUTER REFERENCES:([Recr1014]
35.22541	1	299197	1	\|--Hash Match(Aggregate, HASH:([Recr1014]), RESIDUAL:
1240.829	1	4129317	1	\| \|--Index Spool(WITH STACK)
1240.829	1	4129317	1	\| \|--Concatenation
1	1241.829	0	0	\| \|--Compute Scalar(DEFINE:([Expr1021]=(
1237.773	1	0	0	\| \| \|--Compute Scalar(DEFINE:([Expr10
1237.773	1	1723	1	\| \| \|--Nested Loops(Inner Join,
1237.773	1	1723	1	\| \| \|--Index Seek(OBJECT:([
1	1237.773	1723	1723	\| \| \|--Clustered Index Seek
1.002469	1241.829	4127594	1	\| \|--Assert(WHERE:(CASE WHEN [Expr1023]>
1.002469	1241.829	4127594	1	\| \|--Nested Loops(Inner Join, OUTER
1	1241.829	0	0	\| \|--Compute Scalar(DEFINE:([E
1	1241.829	4129317	1	\| \| \|--Table Spool(WITH STA
3.055948	1240.829	0	0	\| \|--Compute Scalar(DEFINE:([E
3.055948	1240.829	4127594	4129317	\| \|--Nested Loops(Inner J
3.055948	1240.829	4127594	4129317	\| \|--Index Seek(OBJE
1	3791.91	4127594	4127594	\| \|--Clustered Index
1	35.22541	3	299197	\|--Clustered Index Seek(OBJECT:([MCC2].[dbo].[CTEIssu

As you can see - in some cases the estimated rows versus actual rows were significantly skewed.

Improving the performance

The objective of that query was to identify unique rows and remove them from the recursive iteration. If this can be done in the CTE, then that would be ideal, otherwise a more performant alternative is to convert the query into WHILE loop. The CTE is far more efficient than WHILE loop if there is a unique parent/child key. However let me emphasize that when you have non-unique parent/child keys, estimations used to create the query plan is likely to be significantly skewed.

Below is the code that meets the objective of been more restrictive during each iteration. It is followed by an explanation of the reasoning behind the construction of the query, and why it performs better:

 SET NOCOUNT ON

DECLARE @StartPt int = -1512034855;
DECLARE @StartID int = 1809069910;
DECLARE @Counter tinyint = 0;
DECLARE @MaxRecursion tinyint = 50;

DECLARE @CategoryList table(Category int NOT NULL)

DECLARE @PtIDPair table
(
    Pt        int        NOT NULL 
    ,ID        int        NOT NULL
    ,lvl    tinyint    NOT NULL
    ,PRIMARY KEY CLUSTERED (lvl, Pt, ID)
)

CREATE TABLE #Item
    (
      ItemID     int NOT NULL
      ,Category     int NOT NULL
      ,StartPt int NOT NULL
      ,StartID     int NOT NULL
      ,EndPt     int NOT NULL
      ,EndID     int NOT NULL
      ,lvl         tinyint NOT NULL
    )

CREATE CLUSTERED INDEX CDX_#Item_Lvl ON #Item ([lvl])
CREATE NONCLUSTERED INDEX NDX_#Item_ItemID ON #Item (ItemID)

INSERT INTO @CategoryList 
VALUES
    (-379491138)
    ,(-379518300)
    ,(-379520626)
    ,(472917509)
    ,(-379494005)
    ,(-379506005)
    ,(-2134852210)
    ,(-379515627)
    ,(-379515134)
    ,(-379484415);

--Gather the anchors
-- There are a set of ItemID values, which happens to contain a common
-- Pt/Id combination.  These are level 0 (Anchors)
INSERT INTO #Item
    (ItemID, Category, StartPt, StartID, EndPt, EndID, lvl)
SELECT
    ItemID, cti.Category, StartPt, StartID, EndPt, EndID, @Counter
FROM    
    dbo.CTEIssue AS cti
    INNER JOIN
    @CategoryList AS lst ON(lst.Category = cti.Category)
WHERE
    StartPt = @StartPt
    AND StartID = @StartID

-- We want to remove the duplicates, from the final result set, so we do our filtering to create a unique set of anchor pairs
INSERT INTO @PtIDPair(Pt, ID, lvl) 
SELECT EndPt, EndID, @Counter
FROM #Item
WHERE lvl = @Counter
GROUP BY EndPt, EndID

WHILE @@ROWCOUNT > 0
 BEGIN
    SET @Counter += 1
    
    IF @Counter = 50
     BEGIN
        RAISERROR('Max Recursion reached', 16, 1)
        BREAK
     END
    
    INSERT INTO #Item(ItemID, Category, StartPt, StartID, EndPt, EndID, lvl)
    SELECT    cti.ItemID, cti.Category, cti.StartPt, cti.StartID, cti.EndPt, cti.EndID, @Counter
    FROM    
        dbo.CTEIssue AS cti
        INNER JOIN
        @CategoryList AS lst ON(lst.Category = cti.Category)
        INNER JOIN
        @PtIDPair AS pr ON(cti.StartPt = pr.Pt AND cti.StartID = pr.ID)
        LEFT OUTER JOIN
        #Item AS itm ON(itm.ItemID = cti.ItemID)
    WHERE
        itm.ItemID IS NULL
        AND pr.lvl = @Counter - 1

    INSERT INTO @PtIDPair(Pt, ID, lvl) 
    SELECT itm.EndPt, itm.EndID, @Counter
    FROM 
        #Item AS itm
        LEFT OUTER JOIN
        @PtIDPair AS pr ON(itm.EndPt = pr.Pt AND itm.EndID = pr.ID)
    WHERE
        pr.ID IS NULL
        AND itm.lvl = @Counter
    GROUP BY
        itm.EndPt, itm.EndID
 END

SELECT
    cti.*
FROM 
    dbo.CTEIssue AS cti
    INNER JOIN
    #Item AS itm ON(cti.ItemID = itm.ItemID)
WHERE 
    cti.EndPt <> -1512034855
    OR cti.Category = -379484415
GO

Side Note: Our investigations revealed that in this case table variables significantly outperformed local temp tables.

The first step was to declare variables and create a temporary table to store the equivalent of the CTE result set. A table variable was used to store the unique JOIN clause rows generated during the iterations. Just before entering the WHILE…END, we created the first record (a.k.a the Anchor Member). In the WHILE…END loop we found all the children based on the JOIN criteria, further filtered by the predicates, and inserted them into the list of unique records. The next iteration would not collect the same children again. This was the most important distinction.

Improving the CTE Query Plan

Compared to the CTE, the WHILE loop is a fair amount of coding. Prior to re-writing the CTE as a WHILE loop, we attempted to optimize the query plan itself, within the constraint of not being able to generate unique values in the join. We modified the CTE to use a table variable versus using an IN clause. We created indexes and updated the statistics, and achieved a 50% performance improvement, but unfortunately and more importantly missing the objective of improving by at least a factor of 12.

Below are the STATISTICS IO and TIME demonstrating the improvement of adding the TVP and indexes:

Original

(3 row(s) affected)

Table 'CTEIssue'. Scan count 41,592,377, logical reads 137,303,190, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 24,576,797, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 209197ms, elapsed time = 217344ms.

with TVP

(3 row(s) affected)

Table 'CTEIssue'. Scan count 4,428,524, logical reads 41,930,102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 24,575,641, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#6EF57B66'. Scan count 1, logical reads 19,007,602, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 143646ms, elapsed time = 151268ms.

Side Note: Table variables do not maintain statistics and are generally not recommended when there are cost-based plan choices that can result in significant performance variability; one would typically leverage a local temporary table in that case. We used a TVP here because the ISV application was creating an IN clause that could reach hundreds of values, and a table variable data type was more robust, and in some cases more efficient.

When we compare the query plans between the CTE and the WHILE loop, we realized that each iteration of the WHILE loop, has the luxury of a different query plan. This is unlike the CTE which only executed against one query plan.

A significant contribution to the long running nature of the CTE query is this portion of the query plan that estimates less than 10 rows on the outer and inner query, but ends up having to iterate through millions of rows on both sides of the query. For a table that only has a few hundred thousand rows, this is a function of the lack of unique join criteria.

Query Plan Statistics

In Conclusion…

The Common Table Expression (CTE) is a powerful syntax in its ability to address the historically challenging objective of recursive queries using T-SQL. It is a natural and often times appropriate syntax when migrating from the CONNECT BY syntax. Recursive CTE queries do have a reliance on the unique parent/child keys in order to get the best performance. If this is not possible to achieve, then a WHILE loop is potentially a much more efficient approach to handling the recursive query.

Our final results completed this frequently run query within 4 seconds from 144 seconds, a 3600% performance improvement!!!

Comments

Anonymous
January 15, 2013
Nice article ....
Anonymous
October 16, 2013
Great article. Congratulations
Anonymous
November 19, 2014
My recursive CTE doesn't have distinct. Still it take more time. How can I improve the performace of it?
Anonymous
March 04, 2015
My cte's parent/child are in unique fashion only. Still it takes 75 secs. I need it less than 30 secs. What's the solution.
Anonymous
March 27, 2015
If only the above was entirely true... I was facing slow performance of a textbook hierarchical CTE querying organizational structure using reports_to_id = id (of the supervisor) where id was a PK and it was much slower, than a solution with a temporary table populated in a loop while the subordinate records to be inserted did not yet exist. I gained 5x performance after rewriting the CTE as insert into/while loop. This was on 2008R2. Shame on the optimizer developers.
Anonymous
December 03, 2015
Now only if we had similar functionality in MySQL.
Anonymous
February 28, 2017
Impressive work 113 lines of code to get the same result as using "connect by" - Mindblowing indeed!Good Job SQL-Server, not that this would be the most basic thing one could do!

Share via