Common Table Expressions
Code download available at:DataPoints2007_10.exe(150 KB)
Views, Derived Tables, and CTEs
Structure of a CTE
Rules of Recursion
Many projects that developers work on involve writing complex SQL statements that deviate from basic SELECT/FROM/WHERE types of statements. One such scenario involves writing Transact-SQL (T-SQL) queries that use derived tables (also known as inline views) inside a FROM clause. This common practice allows a developer to grab a rowset and immediately join that rowset to other tables, views, and user-defined functions in a SELECT statement. Another option is to use a view instead of a derived table. Both of these options have their advantages and disadvantages.
When working with SQL Server™ 2005, I prefer a third option of using Common Table Expressions (CTEs). CTEs can help improve the readability (and thus the maintainability) of the code without compromising performance. They also make writing recursive code in T-SQL significantly easier than it was in previous versions of SQL Server.
In this month's column, I am focusing on using CTEs to address common development scenarios. I will begin by describing how CTEs work and what scenarios they can be used to address. Then I will discuss the advantages of using CTEs versus using traditional T-SQL constructs, such as derived tables, views, and custom procedures. Throughout the column, I will give examples and explain how and where they can be used. I will also demonstrate how CTEs handle recursive logic and define how a recursive CTE operates. Note that all the code I discuss in this column is available for download from the MSDN® Magazine Web site and uses the Northwind and the AdventureWorks databases that come with SQL Server 2005.
Views, Derived Tables, and CTEs
CTEs can be useful when queries need to select from a set of data that does not exist as a table within the database. For example, you might want to write a query against a set of aggregated data that calculates values based on customers and their orders. The aggregated data might join the Customers, Orders, and Order Details tables together and calculate the sum and average values of the orders. And you may want to query against that aggregated rowset. One solution is to create a view that gathers the aggregated data first and then write a query against that view. Another option is to query against the aggregated data using a derived table. You can do this by moving the SQL statement into the FROM clause and querying against it.
Views are generally used to break down large queries so they can be queried against in a much more readable manner. For instance, a view can represent a SELECT statement that joins 10 tables together, selects dozens of columns, and filters out rows based on an involved set of logic. The view can then be queried against by other SELECT statements throughout the database. This abstraction provides easier access to the rowset being represented by the view, and there's no need to duplicate the data or store it in a temp table.
The view can also be reused throughout the database, assuming permissions allow this. In Figure 1, for example, a view is created and then used by another T-SQL statement. However, in situations where you want to gather data and only use it a single time, views may not be the best solution. Since a view is a database object that exists and is available throughout the database to all batches, creating a view that is only used in a single T-SQL batch is overkill.
Figure 1 View Being Queried
CREATE VIEW vwMyView AS SELECT EmployeeID, COUNT(*) AS NumOrders, MAX(OrderDate) AS MaxDate FROM Orders GROUP BY EmployeeID GO SELECT e.EmployeeID, oe.NumOrders, oe.MaxDate, e.ReportsTo AS ManagerID, om.NumOrders, om.MaxDate FROM Employees AS e INNER JOIN vwMyView AS oe ON e.EmployeeID = oe.EmployeeID INNER JOIN vwMyView AS om ON e.ReportsTo = om.EmployeeID
Another option is to create a derived table (also known as an inline view). A derived table can be created by simply moving a SELECT statement inside of a FROM clause surrounded by parenthesis. It can then be queried against or joined to just like a table or view. The code in Figure 2 solves the same query that Figure 1 solves, but instead of using a view it uses derived tables. While derived tables are only accessible within the statement in which they exist, tables generally make a query more difficult to read and thus to maintain. This problem can be multiplied if you want to use the derived table multiple times within the same batch, as you must then copy and paste the derived table to reuse it.
Figure 2 Query Using Derived Tables
SELECT e.EmployeeID, oe.NumOrders, oe.MaxDate, e.ReportsTo AS ManagerID, om.NumOrders, om.MaxDate FROM Employees AS e INNER JOIN (SELECT EmployeeID, COUNT(*), MAX(OrderDate) FROM Orders GROUP BY EmployeeID) AS oe(EmployeeID, NumOrders, MaxDate) ON e.EmployeeID = oe.EmployeeID LEFT JOIN (SELECT EmployeeID, COUNT(*), MAX(OrderDate) FROM Orders GROUP BY EmployeeID) AS om(EmployeeID, NumOrders, MaxDate) ON e.ReportsTo = om.EmployeeID
A CTE is a nice fit for this type of scenario since it makes the T-SQL much more readable (like a view), yet it can be used more than once in a query that immediately follows in the same batch. Of course, it is not available beyond that scope. Additionally, the CTE is a language-level construct—meaning that SQL Server does not internally create temp or virtual tables. The CTE's underlying query will be called each time it is referenced in the immediately following query.
Therefore, this same scenario can also be written using a CTE, as shown in Figure 3. The EmpOrdersCTE gathers the aggregated data and is then used by the query that immediately follows the CTE. Using the CTE, the code in Figure 3 makes the query very readable (like a view), yet does not create a system object to store the metadata.
Figure 3 Querying with a CTE
; WITH EmpOrdersCTE (EmployeeID, NumOrders, MaxDate) AS (SELECT EmployeeID, COUNT(*), MAX(OrderDate) FROM Orders GROUP BY EmployeeID) SELECT e.EmployeeID, oe.NumOrders, oe.MaxDate, e.ReportsTo AS ManagerID, om.NumOrders, om.MaxDate FROM Employees AS e INNER JOIN EmpOrdersCTE oe ON e.EmployeeID = oe.EmployeeID LEFT JOIN EmpOrdersCTE om ON e.ReportsTo = om.EmployeeID
Structure of a CTE
Now I'll demonstrate how to construct a CTE using a simple CTE as an example. A CTE begins with the WITH keyword. However, if the CTE is not the first statement in the batch, you must precede the WITH keyword with a semicolon. As a best practice, I prefer to prefix all of my CTEs with a semicolon—I find this consistent approach easier than having to remember whether I need a semicolon or not.
The WITH keyword is followed by the name of the CTE, which is then followed by an optional list of column aliases. The column aliases correspond to the columns that are returned by the SELECT statement inside the CTE. Following the optional column aliases is the AS keyword—this is required. And the AS keyword is then followed by the query expression that defines the CTE enclosed within parenthesis.
Take a look at this example:
; WITH myCTE (CustID, Co) AS (SELECT CustomerID, CompanyName FROM Customers) SELECT CustID, Co FROM myCTE
The CustomerID and CompanyName columns are aliased with CustID and Co. Then immediately following the CTE is a SELECT statement that references the CTE using its column aliases.
Before you design a CTE, you must understand how it works and what rules it follows. This section describes where CTEs can be used as well as what can and cannot be used inside a CTE. For starters, CTEs can be created and used inside of a T-SQL batch, a user-defined function, a stored procedure, a trigger, or a view.
A CTE can only be referenced by the statement that immediately follows the CTE. This means that if you want to use a CTE, you must write the query that refers to the CTE immediately after the CTE in the T-SQL batch. For example, the following batch will produce an error:
; WITH myCTE (CustID, Co) AS (SELECT CustomerID, CompanyName FROM Customers) SELECT CompanyName FROM Customers WHERE CustomerID = 'ALFKI' SELECT CustID, Co FROM myCTE
In this code, myCTE is only available for the first query that follows it. When the second query refers to myCTE, the CTE is not in scope and an exception is thrown (Invalid object name 'myCTE').
Also note that since a CTE is intended to be referred to by another query, which may then reprocess the data anyway, a CTE's query cannot contain statements such as ORDER and COMPUTE. However, complex statements such as FOR XML can still be used to define and operate on a CTE. For example, you could query a CTE and return its results using the FOR XML clause, as shown here.
; WITH myCTE AS (SELECT c.CustomerID, c.CompanyName, o.OrderID, o.OrderDate FROM Customers c INNER JOIN Orders o ON c.CustomerID = o.CustomerID) SELECT CustomerID, CompanyName, OrderID, OrderDate FROM myCTE FOR XML AUTO
Once a CTE is defined, it can be referenced multiple times by the first query that follows it. This is especially useful when a query needs to refer to the CTE more than once. The code sample in Figure 3 demonstrates how the EmpOrdersCTE is referred to twice by the query so it can grab the emplo yees and their supervisors. This is very useful when you need to reference the same rowset more than once; it is simpler to refer to the CTE twice than it is to duplicate the query.
A CTE does not have to be used by a SELECT statement; it is available for use by any statement that refers to the rowset that the CTE generates. This means a CTE can be followed by a SELECT, INSERT, UPDATE, or DELETE statement that uses the CTE. You can also use forward only and snapshot cursors on queries that use CTEs.
Furthermore, a CTE can be followed by another CTE. This technique can be used to build a CTE off of another CTE when you want to gather intermediate results into a rowset. When creating a CTE that builds off another CTE, you separate the CTE definitions with a comma.
The example shown in Figure 4 defines the EmpOrdersCTE, which gathers a list of employees and the total number of orders for each employee. The second CTE, named MinMaxOrdersCTE, queries the first EmpOrdersCTE and performs the aggregate functions on the rowset to determine the average, minimum, and maximum number of orders for the employees.
Figure 4 CTE Referencing Another CTE
; WITH EmpOrdersCTE (EmployeeID, NumOrders) AS (SELECT EmployeeID, COUNT(*) FROM Orders GROUP BY EmployeeID), MinMaxOrdersCTE (Mn, Mx, Diff) AS (SELECT MIN(NumOrders), MAX(NumOrders), AVG(NumOrders) FROM EmpOrdersCTE) SELECT Mn, Mx, Diff FROM MinMaxOrdersCTE
Multiple CTEs can be defined after the WITH keyword by separating them with a comma. Each CTE can be referenced by the CTE that follows it, building off of each other. The data manipulation language (DML) statement that follows the CTE definitions can also refer to any of the CTEs defined within the WITH clause.
Rules of Recursion
CTEs can also be used to implement recursive algorithms. Recursive logic is useful when you need to write an algorithm that calls itself—this is often done to traverse a nested set of data. Writing recursive logic can be complex, especially in a language like T-SQL. However, this is one of the specific problems that CTEs were designed to address. The basic formula for creating a recursive CTE is as follows:
- Create the query that returns the top level (this is the anchor member).
- Write a recursive query (this is the recursive member).
- UNION the first query with the recursive query.
- Make sure you have a case where no rows will be returned (this is your termination check).
Your recursive CTE will look something like this:
; WITH myRecursiveCTE(col1, col2, ... coln) AS ( -- Anchor Member Query UNION ALL -- Recursive Member Query that references myRecursiveCTE )
When writing a custom recursion procedure that does not involve CTEs, you must include an explicit termination clause. The termination clause is responsible for making sure that the recursive algorithm eventually terminates and bubbles back up the recursive call stack. Without this clause, your code could end up in an infinite loop.
CTEs, however, have two aspects that help with handling termination clauses. The first is an implicit termination clause that occurs when the recursive member returns zero records. When this happens, the recursive member query does not invoke the CTE recursively but instead bubbles back up the call stack. The other is that you can explicitly set the MAXRECURSION level.
The MAXRECURSION level can be set explicitly in a batch that contains the CTE or through a server-side setting (the server-wide setting defaults to 100 unless you change it). This setting limits the number of times a CTE can recursively call itself. If the limit is reached, an exception is thrown. The syntax for setting the MAXRECURSION level is to use the OPTION clause in your SELECT statement that follows the CTE, like so:
-- DEFINE YOUR CTE HERE SELECT * FROM EmpCTE OPTION (MAXRECURSION 7)
There are some other rules you should also keep in mind when designing recursive CTEs. A recursive CTE must contain both an anchor member and a recursive member. Both members must have the same number of columns and the columns belonging to both members must have matching datatypes. The recursive member can only refer to the CTE once and the members cannot use the following clauses or keywords:
- SELECT DISTINCT
- GROUP BY
- LEFT/RIGHT OUTER JOIN
In terms of data and rowsets, recursion is used to resolve a problem when you need to perform the same logic repeatedly against the same set of data, under different terms. For example, say you are asked to find all of the salespeople and who they work for and return the data in hierarchical order. Figure 5 demonstrates a solution using a CTE that uses recursion to gather the list of employees that work for the VP of Sales.
Figure 5 Recursively Gathering Salespeople
; WITH EmpCTE(EmployeeID, EmployeeFirstName, EmployeeLastName, MgrID, SalesLevel) AS (-- Anchor Member SELECT EmployeeID, FirstName, LastName, ReportsTo, 0 FROM Employees WHERE EmployeeID = 2 -- Start with the VP of Sales UNION ALL -- Recursive Member SELECT e.EmployeeID, e.FirstName, e.LastName, e.ReportsTo, m.SalesLevel+1 FROM Employees AS e INNER JOIN EmpCTE m ON e.ReportsTo = m.EmployeeID) -- Using the CTE SELECT EmployeeID, EmployeeFirstName, EmployeeLastName, MgrID, SalesLevel FROM EmpCTE
Except for a few additional aspects, the recursive CTE shown in Figure 5 looks very much like a standard CTE. Where a standard CTE contains a query that defines a rowset, a recursive CTE defines two query definitions. The first query definition, the anchor member, defines a query that will be executed when the CTE is called. The second query definition, the recursive member, defines a query that returns the same columns and datatypes as the anchor member. The recursive member also retrieves the values that will then be used to call back into the CTE recursively. The results of the queries are pulled together using a UNION statement.
The EmpCTE in Figure 5 shows an anchor member that grabs the employee record for the VP of Sales (EmployeeID = 2). The last column of the anchor member's query returns a value of 0, which represents the 0th level of the hierarchy, which is the top. The recursive member's query grabs the list of employees that report to the previous employee. This is accomplished by joining the Employees table to the EmpCTE.
The same columns are retrieved from the recursive member, but the SalesLevel column is calculated by taking the current employee's supervisor, grabbing the supervisor's SalesLevel, and incrementing it by 1. The expression m.SalesLevel+1 assigns a SalesLevel of 1 to all people who report directly to the VP of Sales (retrieved from our anchor member). Any employees that report to those people then have a SalesLevel of 2. The SalesLevel continues to increment in this way for each subsequent level of the sales organization's hierarchy.
CTEs provide a way to make writing T-SQL much more readable when compared to scenarios that use complex derived tables within a query or referencing a view whose definition is external to the T-SQL batch. CTEs also provide a much improved tool to address the struggles involved in using recursive algorithms. Whether you are using non-recursive or recursive CTEs, you'll find CTEs can help you address many common development scenarios and improve readability without sacrificing performance.
Send your questions and comments for John to firstname.lastname@example.org.
John Papa is a Senior .NET Consultant with ASPSOFT (aspsoft.com) and a baseball fanatic who spends most of his summer nights rooting for the Yankees with his family and his faithful dog, Kadi. John, a C# MVP, has authored several books on ADO, XML, and SQL Server. He can often be found speaking at industry conferences, such as VSLive.