SQL Server 2005 Beta 2 Transact-SQL Enhancements
Itzik Ben-Gan
Solid Quality Learning
December, 2004
Applies to:
Transact-SQL
Microsoft SQL Server 2005 Beta 2
Summary: This white paper introduces several of the new enhancements to Transact-SQL in Microsoft SQL Server 2005 Beta 2. These new features can increase your expressive power, the performance of your queries, and your error management capabilities. This paper focuses mainly on relational enhancements that are conceptually new and demonstrates these through practical examples. This paper does not cover every new Transact-SQL feature. (91 printed pages)
Introduction and Scope
Increasing the Expressive Power of Queries and DRI Support
Partitioning
Single-Parent Environment: Employees Organizational Chart
Multiparent Environment: Bill of Materials
Table-Valued Functions in Correlated Subqueries
Enhancements for Performance and Error Handling
Other SQL Server 2005 Beta 2 Capabilities That Affect Transact-SQL
Conclusion
This white paper introduces several of the new enhancements to Transact-SQL in Microsoft SQL Server 2005 Beta 2. These new features can increase your expressive power, the performance of your queries, and your error management capabilities. This paper focuses mainly on relational enhancements that are conceptually new and demonstrates these through practical examples. This paper does not cover every new Transact-SQL feature.
Assumed knowledge: The target audience is expected to be skilled at using Transact-SQL for ad hoc queries and as components of applications in Microsoft SQL Server 2000.
This section introduces the following new relational features and enhancements:
- New ranking functions
- New recursive queries based on common table expressions (CTE)
- New PIVOT and APPLY relational operators
- Declarative referential integrity (DRI) enhancements
SQL Server 2005 introduces four new ranking functions: ROW_NUMBER, RANK, DENSE_RANK and NTILE. The new functions allow you to efficiently analyze data and provide ranking values to result rows of a query. Typical scenarios where you may find the new functions helpful include: assigning sequential integers to result rows for presentation purposes, paging, scoring, and histograms.
Speaker Statistics Scenario
The following Speaker Statistics scenario will be used to discuss and demonstrate the different functions and their clauses. A large computing conference included three tracks: database, development, and system administration. Eleven speakers spoke in the conference and got scores in the range 1 through 9 for their sessions. The results were summarized and stored in the following SpeakerStats table:
USE tempdb -- or your own test database
CREATE TABLE SpeakerStats
(
speaker VARCHAR(10) NOT NULL PRIMARY KEY,
track VARCHAR(10) NOT NULL,
score INT NOT NULL,
pctfilledevals INT NOT NULL,
numsessions INT NOT NULL
)
SET NOCOUNT ON
INSERT INTO SpeakerStats VALUES('Dan', 'Sys', 3, 22, 4)
INSERT INTO SpeakerStats VALUES('Ron', 'Dev', 9, 30, 3)
INSERT INTO SpeakerStats VALUES('Kathy', 'Sys', 8, 27, 2)
INSERT INTO SpeakerStats VALUES('Suzanne', 'DB', 9, 30, 3)
INSERT INTO SpeakerStats VALUES('Joe', 'Dev', 6, 20, 2)
INSERT INTO SpeakerStats VALUES('Robert', 'Dev', 6, 28, 2)
INSERT INTO SpeakerStats VALUES('Mike', 'DB', 8, 20, 3)
INSERT INTO SpeakerStats VALUES('Michele', 'Sys', 8, 31, 4)
INSERT INTO SpeakerStats VALUES('Jessica', 'Dev', 9, 19, 1)
INSERT INTO SpeakerStats VALUES('Brian', 'Sys', 7, 22, 3)
INSERT INTO SpeakerStats VALUES('Kevin', 'DB', 7, 25, 4)
Each speaker has one row in the table with the speaker's name, track, average score, percent of attendees that filled evaluations in respect to the number of attendees that attended the sessions, and the number of sessions delivered by the speaker. This section demonstrates how to analyze the speaker statistics data to generate useful information using the new ranking functions.
Semantics
All four ranking functions follow a similar syntax pattern:
Ranking Function
<function_name>() OVER(
[PARTITION BY <partition_by_list>]
ORDER BY <order_by_list>)
The function can be specified only in two clauses of a query—in the SELECT clause or in the ORDER BY clause. The following sections discuss the different functions in detail.
The ROW_NUMBER function allows you to provide sequential integer values to result rows of a query. For example, suppose you wanted to return the speaker, track, and score of all speakers, assigning sequential values from 1 and on to the result rows according to descending score order. The following query generates the desired results by using the ROW_NUMBER function, specifying OVER (ORDER BY score DESC):
SELECT ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
speaker, track, score
FROM SpeakerStats
ORDER BY score DESC
Here is the result set:
rownum speaker track score
------ ---------- ---------- -----------
1 Jessica Dev 9
2 Ron Dev 9
3 Suzanne DB 9
4 Kathy Sys 8
5 Michele Sys 8
6 Mike DB 8
7 Kevin DB 7
8 Brian Sys 7
9 Joe Dev 6
10 Robert Dev 6
11 Dan Sys 3
The speaker with the highest score got row number 1, and the speaker with the lowest score got row number 11. ROW_NUMBER always generates distinct row numbers to different rows according to the requested sort. Note that if the ORDER BY list specified within the OVER() option is not unique, the result is non-deterministic. This means that there's more than one correct result to the query; in different invocations of the query you might get different results. For example, in our case three different speakers got the same highest score (9): Jessica, Ron, and Suzanne. Since SQL Server has to assign different row numbers to the different speakers, you should assume that the values 1, 2, and 3 assigned Jessica, Ron, and Suzanne respectively were assigned in arbitrary order among those speakers. The result would have been just as correct if the values 1, 2, and 3 were assigned to Ron, Suzanne, and Jessica respectively.
If you specify a unique ORDER BY list, the result is always deterministic. For example, suppose that in the case of a tie between speakers based on score you want to use the highest pctfilledevals value as the tiebreaker. If there's still a tie, use the highest numsessions value as the tiebreaker. Finally, if there's still a tie, use the lowest dictionary-order speaker name as a tiebreaker. Since the ORDER BY list—score, pctfilledevals, numsessions, and speaker—is unique, the result is deterministic:
SELECT ROW_NUMBER() OVER(ORDER BY score DESC, pctfilledevals DESC,
numsessions DESC, speaker) AS rownum,
speaker, track, score, pctfilledevals, numsessions
FROM SpeakerStats
ORDER BY score DESC, pctfilledevals DESC, numsessions DESC, speaker
Here is the result set:
rownum speaker track score pctfilledevals numsessions
------ ---------- ---------- ----------- -------------- -----------
1 Ron Dev 9 30 3
2 Suzanne DB 9 30 3
3 Jessica Dev 9 19 1
4 Michele Sys 8 31 4
5 Kathy Sys 8 27 2
6 Mike DB 8 20 3
7 Kevin DB 7 25 4
8 Brian Sys 7 22 3
9 Robert Dev 6 28 2
10 Joe Dev 6 20 2
11 Dan Sys 3 22 4
One of the important benefits of the new ranking functions is their efficiency. SQL Server's optimizer needs to scan the data only once in order to calculate the values. It does this either by using an ordered scan of an index placed on the sort columns or by scanning the data once and sorting it if an appropriate index was not created.
Another benefit is the simplicity of the syntax. To give you a sense of how difficult and inefficient it is to calculate ranking values by using the set-based approach used in earlier releases of SQL Server, consider the following SQL Server 2000 query, which returns the same results as the previous query:
SELECT
(SELECT COUNT(*)
FROM SpeakerStats AS S2
WHERE S2.score > S1.score
OR (S2.score = S1.score
AND S2.pctfilledevals > S1.pctfilledevals)
OR (S2.score = S1.score
AND S2.pctfilledevals = S1.pctfilledevals
AND S2.numsessions > S1.numsessions)
OR (S2.score = S1.score
AND S2.pctfilledevals = S1.pctfilledevals
AND S2.numsessions = S1.numsessions
AND S2.speaker < S1.speaker)) + 1 AS rownum,
speaker, track, score, pctfilledevals, numsessions
FROM SpeakerStats AS S1
ORDER BY score DESC, pctfilledevals DESC, numsessions DESC, speaker
This query is obviously much more complex than the SQL Server 2005 query. Furthermore, for each base row in the SpeakerStats table, SQL Server has to scan all matching rows in another instance of the table. On average, about half (at minimum) of the table's rows need to be scanned per each row in the base table. Performance degradation of the SQL Server 2005 query is linear, while performance degradation of the SQL Server 2000 query is exponential. Even in fairly small tables the performance difference is significant. For example, test the performance of the following queries which query the SalesOrderHeader table in the AdventureWorks database to calculate row numbers for sales orders according to SalesOrderID order. The SalesOrderHeader table has 31,465 rows. The first query uses the SQL Server 2005 ROW_NUMBER function, while the second query uses the SQL Server 2000 subquery technique:
-- SQL Server 2005 query
SELECT SalesOrderID,
ROW_NUMBER() OVER(ORDER BY SalesOrderID) AS rownum
FROM Sales.SalesOrderHeader
-- SQL Server 2000 query
SELECT SalesOrderID,
(SELECT COUNT(*)
FROM Sales.SalesOrderHeader AS S2
WHERE S2.SalesOrderID <= S1.SalesOrderID) AS rownum
FROM Sales.SalesOrderHeader AS S1
I ran this test on my laptop (Compaq Presario X1020U, CPU: Centrino 1.4 GH, RAM: 1GB, local HD). The SQL Server 2005 query finished in only 1 second while the SQL Server 2000 query finished in about 12 minutes.
A typical application for row numbers is to page through the results of a query. Given a page size in terms of number of rows, and a page number, you need to return the rows that belong to the given page. For example, suppose you want to return the second page of rows from the SpeakerStats table, assuming a page size of three rows, according to score DESC, speaker order. The following query first calculates row numbers according to the specified sort in the derived table D, then it filters only rows with the row numbers 4 through 6, which belong to the second page:
SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum,
speaker, track, score
FROM SpeakerStats) AS D
WHERE rownum BETWEEN 4 AND 6
ORDER BY score DESC, speaker
Here is the result set:
rownum speaker track score
------ ---------- ---------- -----------
4 Kathy Sys 8
5 Michele Sys 8
6 Mike DB 8
In more generic terms, given a page number in the @pagenum variable, and a page size in the @pagesize variable, the following query returns the rows that belong to the desired page:
DECLARE @pagenum AS INT, @pagesize AS INT
SET @pagenum = 2
SET @pagesize = 3
SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum,
speaker, track, score
FROM SpeakerStats) AS D
WHERE rownum BETWEEN (@pagenum-1)*@pagesize+1 AND @pagenum*@pagesize
ORDER BY score DESC, speaker
The above approach is adequate for ad hoc requests when you're only interested in one specific page of the rows. However, this approach is not adequate when the user issues multiple requests because each invocation of the query costs you a complete scan of the table in order to calculate the row numbers. For more efficient paging when the user might repeatedly request different pages, first populate a temporary table with all of the base table rows including calculated row numbers and index the column containing the row numbers:
SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, *
INTO #SpeakerStatsRN
FROM SpeakerStats
CREATE UNIQUE CLUSTERED INDEX idx_uc_rownum ON #SpeakerStatsRN(rownum)
And then for each requested page issue the following query:
SELECT rownum, speaker, track, score
FROM #SpeakerStatsRN
WHERE rownum BETWEEN (@pagenum-1)*@pagesize+1 AND @pagenum*@pagesize
ORDER BY score DESC, speaker
Only the rows that belong to the desired page will be scanned.
Ranking values can be calculated within groups of rows independently as opposed to being calculated for all table rows as one group. To do this, use the PARTITION BY clause and specify a list of expressions that identify the groups of rows for which ranking values should be calculated independently. For example, the following query assigns row numbers within each track separately according to score DESC, speaker order:
SELECT track,
ROW_NUMBER() OVER(
PARTITION BY track
ORDER BY score DESC, speaker) AS pos,
speaker, score
FROM SpeakerStats
ORDER BY track, score DESC, speaker
Here is the result set:
track pos speaker score
---------- --- ---------- -----------
DB 1 Suzanne 9
DB 2 Mike 8
DB 3 Kevin 7
Dev 1 Jessica 9
Dev 2 Ron 9
Dev 3 Joe 6
Dev 4 Robert 6
Sys 1 Kathy 8
Sys 2 Michele 8
Sys 3 Brian 7
Sys 4 Dan 3
Specifying the track column in the PARTITION BY clause causes row numbers to be calculated independently for each group of rows with the same track.
The RANK and DENSE_RANK functions are very similar to the ROW_NUMBER function in the sense that they also provide ranking values according to a specified sort, optionally within groups (partitions) of rows. However, unlike ROW_NUMBER, RANK and DENSE_RANK assign the same ranks to rows with the same values in the sort columns. RANK and DENSE_RANK are useful when you don't want to assign different ranks for rows with the same values in the ORDER BY list when the ORDER BY list is not unique. The purpose of RANK and DENSE_RANK and the difference between the two is best explained with an example. The following query calculates row number, rank, and dense rank values to the different speakers according to score DESC order:
SELECT speaker, track, score,
ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
RANK() OVER(ORDER BY score DESC) AS rnk,
DENSE_RANK() OVER(ORDER BY score DESC) AS drnk
FROM SpeakerStats
ORDER BY score DESC
Here is the result set:
speaker track score rownum rnk drnk
---------- ---------- ----------- ------ --- ----
Jessica Dev 9 1 1 1
Ron Dev 9 2 1 1
Suzanne DB 9 3 1 1
Kathy Sys 8 4 4 2
Michele Sys 8 5 4 2
Mike DB 8 6 4 2
Kevin DB 7 7 7 3
Brian Sys 7 8 7 3
Joe Dev 6 9 9 4
Robert Dev 6 10 9 4
Dan Sys 3 11 11 5
As discussed earlier, the score column is not unique, so different speakers might have the same score. Row numbers do represent a descending score order, but speakers with the same score still get different row numbers. However, notice in the result that all speakers with the same scores get the same rank and dense rank values. In other words, ROW_NUMBER is not deterministic when the ORDER BY list is not unique, while RANK and DENSE_RANK are always deterministic. The difference between the rank and dense rank values is that rank stands for: the number of rows with a higher score plus one, while dense rank stands for: the number of distinct higher scores plus one. From what you've learned so far you can deduce that ROW_NUMBER, RANK, and DENSE_RANK produce the exact same values when the ORDER BY list is unique.
NTILE allows you to separate the result rows of a query into a specified number of groups (tiles) according to a specified order. Each group of rows gets a different number starting with 1 for the first group, 2 for the second, and so on. You specify the requested number of groups in the parentheses following the function's name, and the requested sort in the ORDER BY clause of the OVER option. The number of rows in a group is calculated as total_num_rows / num_groups. If there's a remainder n, the first n groups get an additional row. So all groups might not get an equal number of rows, but the group sizes might differ at most by one row. For example, the following query assigns three group numbers to the different speaker rows according to descending score order:
SELECT speaker, track, score,
ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
NTILE(3) OVER(ORDER BY score DESC) AS tile
FROM SpeakerStats
ORDER BY score DESC
Here is the result set:
speaker track score rownum tile
---------- ---------- ----------- ------ ----
Jessica Dev 9 1 1
Ron Dev 9 2 1
Suzanne DB 9 3 1
Kathy Sys 8 4 1
Michele Sys 8 5 2
Mike DB 8 6 2
Kevin DB 7 7 2
Brian Sys 7 8 2
Joe Dev 6 9 3
Robert Dev 6 10 3
Dan Sys 3 11 3
There are 11 speakers in the SpeakerStats table. Dividing 11 by 3 gives you a group size of 3 with a remainder of 2, meaning that the first 2 groups will get an additional row (4 rows in each group), and the third group won't (3 rows in the group). The group number (tile number) 1 is assigned to rows 1 through 4, group number 2 is assigned to rows 5 through 8, and group number 3 is assigned to rows 9 through 11. This information allows you to generate a histogram with an even distribution of items for each step. In our case, the first step represents the speakers with the highest scores, the second represents the speakers with the medium scores, and the third represents the speakers with the lowest scores. You can use a CASE expression to provide descriptive meaningful alternatives to the group numbers:
SELECT speaker, track, score,
CASE NTILE(3) OVER(ORDER BY score DESC)
WHEN 1 THEN 'High'
WHEN 2 THEN 'Medium'
WHEN 3 THEN 'Low'
END AS scorecategory
FROM SpeakerStats
ORDER BY track, speaker
Here is the result set:
speaker track score scorecategory
---------- ---------- ----------- -------------
Kevin DB 7 Medium
Mike DB 8 Medium
Suzanne DB 9 High
Jessica Dev 9 High
Joe Dev 6 Low
Robert Dev 6 Low
Ron Dev 9 High
Brian Sys 7 Medium
Dan Sys 3 Low
Kathy Sys 8 High
Michele Sys 8 Medium
This section explores the subtleties of recursive CTE expressions and applies them as solutions to common problems in a way that greatly simplifies traditional approaches.
A common table expression (CTE) is a temporary named result set that can be referred to by a defining statement. In their simple form, you can think of CTEs as an improved version of derived tables that more closely resemble a nonpersistent type of view. You refer to a CTE in the FROM clause of a query similar to the way you refer to derived tables and views. You define the CTE only once and can refer to it several times in your query. In the definition of a CTE, you can refer to variables that are defined in the same batch. You can even use CTEs in INSERT, UPDATE, DELETE, and CREATE VIEW statements, similar to the way you use views. The real power of CTEs, however, is in their recursive capabilities, in which CTEs contain references to themselves. In this paper, CTEs are described first in their simple form and later in their recursive form. This paper covers SELECT queries with CTEs.
You use derived tables when you want to refer to a query result as if it were a table, but you do not want to create a persistent view in the database. Derived tables, however, have a limitation not found in CTEs: You cannot define a derived table once in your query and use it several times. Instead, you must define several derived tables with the same query. You can, however, define a CTE once and use it several times in a query without persisting it in the database.
Before providing a practical example of CTEs, the basic syntax of CTEs is compared to derived tables and views. The following is a general form of a query within a view, derived table, and CTE:
View
CREATE VIEW <view_name>(<column_aliases>)
AS
<view_query>
GO
SELECT *
FROM <view_name>
Derived Table
SELECT *
FROM (<derived_table_query>) AS <derived_table_alias>(<column_aliases>)
CTE
WITH <cte_alias>(<column_aliases>)
AS
(
<cte_query>
)
SELECT *
FROM <cte_alias>
You provide the CTE with an alias and an optional list of aliases for its result columns following the keyword WITH; write the body of the CTE; and refer to it from the outer query.
Note that if the WITH clause for a CTE is not the first statement in the batch, you should delimit it from the preceding statement by placing a semicolon (;) in front of it. The semicolon is used to avoid ambiguity with other uses of the WITH clause (for example, for table hints). Although you may find that including a semicolon is not necessary in all cases, it is recommended that you use it consistently.
As a practical example, consider the HumanResources.Employee and Purchasing.PurchaseOrderHeader tables in the AdventureWorks database. Each employee reports to a manager specified in the ManagerID column. Each employee in the Employee table might have related orders in the PurchaseOrderHeader table. Suppose you want to return, for each employee, their count of orders and last order date and, in the same row, similar details for the manager. The following example shows how you can implement a solution using views, derived tables, and CTEs:
View
CREATE VIEW VEmpOrders(EmployeeID, NumOrders, MaxDate)
AS
SELECT EmployeeID, COUNT(*), MAX(OrderDate)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID
GO
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
JOIN VEmpOrders AS OE
ON E.EmployeeID = OE.EmployeeID
LEFT OUTER JOIN VEmpOrders AS OM
ON E.ManagerID = OM.EmployeeID
Derived Tables
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
JOIN (SELECT EmployeeID, COUNT(*), MAX(OrderDate)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID) AS OE(EmployeeID, NumOrders, MaxDate)
ON E.EmployeeID = OE.EmployeeID
LEFT OUTER JOIN
(SELECT EmployeeID, COUNT(*), MAX(OrderDate)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID) AS OM(EmployeeID, NumOrders, MaxDate)
ON E.ManagerID = OM.EmployeeID
CTE
WITH EmpOrdersCTE(EmployeeID, NumOrders, MaxDate)
AS
(
SELECT EmployeeID, COUNT(*), MAX(OrderDate)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID
)
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
JOIN EmpOrdersCTE AS OE
ON E.EmployeeID = OE.EmployeeID
LEFT OUTER JOIN EmpOrdersCTE AS OM
ON E.ManagerID = OM.EmployeeID
The CTE's definition must be followed by an outer query, which may or may not refer to it. You cannot refer to the CTE later in the batch after other intervening statements.
You can define several CTEs in the same WITH clause, each referring to previously defined CTEs. Commas are used to delimit the CTEs. For example, suppose you wanted to calculate the minimum, maximum, and difference of counts of employee orders:
WITH
EmpOrdersCTE(EmployeeID, Cnt)
AS
(
SELECT EmployeeID, COUNT(*)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID
),
MinMaxCTE(MN, MX, Diff)
AS
(
SELECT MIN(Cnt), MAX(Cnt), MAX(Cnt)-MIN(Cnt)
FROM EmpOrdersCTE
)
SELECT * FROM MinMaxCTE
Here is the result set:
MN MX Diff
----------- ----------- -----------
160 400 240
In EmpOrdersCTE, you calculate the number of orders from each employee. In MinMaxCTE, you refer to EmpOrdersCTE to calculate the minimum, maximum and difference of counts.
Note Within a CTE, you are not limited to referring only to the CTE defined directly before it; rather you can refer to all previously defined CTEs. Note that forward references are not allowed: A CTE can refer to CTEs defined before it and to itself (see Recursive Queries later in this paper) but not to CTEs defined after it. For example, if you define CTEs C1, C2, C3 in the same WITH statement, C2 can refer to C1 and C2 but not to C3.
In another example, the following code generates a histogram that calculates the number of employees that fall within each of four ranges of order counts between the minimum and maximum. If the calculations appear complicated to you, do not spend time trying to figure them out. The purpose of this example is to use a practical scenario to demonstrate the declaration of multiple CTEs in the same WITH statement, where each might refer to previous CTEs.
WITH
EmpOrdersCTE(EmployeeID, Cnt)
AS
(
SELECT EmployeeID, COUNT(*)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID
),
MinMaxCTE(MN, MX, Diff)
AS
(
SELECT MIN(Cnt), MAX(Cnt), MAX(Cnt)-MIN(Cnt)
FROM EmpOrdersCTE
),
NumsCTE(Num)
AS
(
SELECT 1 AS Num
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
),
StepsCTE(Step, Fromval, Toval)
AS
(
SELECT
Num,
CAST(MN + ROUND((Num-1)*((Diff+1)/4.), 0) AS INT),
CAST(MN + ROUND((Num)*((Diff+1)/4.), 0) - 1 AS INT)
FROM MinMaxCTE CROSS JOIN NumsCTE
),
HistogramCTE(Step, Fromval, Toval, Samples)
AS
(
SELECT S.Step, S.Fromval, S.Toval, COUNT(EmployeeID)
FROM StepsCTE AS S
LEFT OUTER JOIN EmpOrdersCTE AS OE
ON OE.Cnt BETWEEN S.Fromval AND S.Toval
GROUP BY S.Step, S.Fromval, S.Toval
)
SELECT * FROM HistogramCTE
Here is the result set:
Step Fromval Toval Samples
----------- ----------- ----------- -----------
1 160 219 2
2 220 280 0
3 281 340 0
4 341 400 10
Notice that the second CTE (MinMaxCTE) refers to the first (EmpOrdersCTE); the third (NumsCTE) does not refer to any CTE. The fourth (StepsCTE) refers to the second and third CTEs, and the fifth (HistogramCTE) refers to the first and fourth CTEs.
Nonrecursive CTEs increase your expressive power. But for each piece of code that uses nonrecursive CTEs, you can usually write shorter code that achieves the same results by using other Transact-SQL constructs, such as derived tables. The case is different with recursive CTEs. This section describes the semantics of recursive queries and provides practical implementations for a hierarchy of employees in an organizational chart, and for a bill of materials (BOM) scenario.
Semantics
When a CTE refers to itself, it is considered to be recursive. Recursive CTEs are constructed from at least two queries (or members in recursive query parlance). One is a nonrecursive query, also referred to as the anchor member (AM). The other is the recursive query, also referred to as the recursive member (RM). The queries are separated by a UNION ALL operator. The following example shows a simplified generic form of a recursive CTE:
WITH RecursiveCTE(<column_list>)
AS
(
-- Anchor Member:
-- SELECT query that does not refer to RecursiveCTE
SELECT ...
FROM <some_table(s)>
...
UNION ALL
-- Recursive Member
-- SELECT query that refers to RecursiveCTE
SELECT ...
FROM <some_table(s)>
JOIN RecursiveCTE
...
)
-- Outer Query
SELECT ...
FROM RecursiveCTE
...
Logically you can think of the algorithm implementing the recursive CTE as:
- The anchor member is activated. Set R0 (R for Results) is generated.
- The recursive member is activated, getting set Ri (i = step number) as input when referring to RecursiveCTE. Set Ri + 1 is generated.
- The logic of Step 2 is run repeatedly (incrementing the step number in each iteration) until an empty set is returned.
- The outer query is executed, getting the cumulative (UNION ALL) result of all of the previous steps when referring to RecursiveCTE.
You can have more than two members in the CTE, but only a UNION ALL operator is allowed between a recursive member and another member (recursive or nonrecursive). Other operators, such as UNION, are allowed only between nonrecursive members. Unlike regular UNION and UNION ALL operators that support implicit conversion, recursive CTEs require an exact match of the columns in all members, including the same data type, length, and precision.
There are similarities between recursive CTEs and classic recursive routines (not necessarily specific to SQL Server). Recursive routines usually consist of three important elements—the first invocation of the routine, a recursive termination check, and a recursive call to the same routine. The anchor member in a recursive CTE corresponds to the first invocation of the routine in a classic recursive routine. The recursive member corresponds to the recursive invocation of the routine. The termination check, which is usually explicit in recursive routines (for example, by means of an IF statement), is implicit in a recursive CTE—recursion stops when no rows are returned from the previous invocation.
The following sections present practical examples and uses of recursive CTEs in single-parent and multiparent environments.
For a single-parent hierarchy scenario, an employees organizational chart is used.
Note The examples in this section use a table called Employees that has a structure that is different from the HumanResources.Employee table in AdventureWorks. You should run the code in your own test database or in tempdb, not in AdventureWorks.
The following code generates the Employees table and populates it with sample data:
USE tempdb -- or your own test database
CREATE TABLE Employees
(
empid int NOT NULL,
mgrid int NULL,
empname varchar(25) NOT NULL,
salary money NOT NULL,
CONSTRAINT PK_Employees PRIMARY KEY(empid),
CONSTRAINT FK_Employees_mgrid_empid
FOREIGN KEY(mgrid)
REFERENCES Employees(empid)
)
CREATE INDEX idx_nci_mgrid ON Employees(mgrid)
SET NOCOUNT ON
INSERT INTO Employees VALUES(1 , NULL, 'Nancy' , $10000.00)
INSERT INTO Employees VALUES(2 , 1 , 'Andrew' , $5000.00)
INSERT INTO Employees VALUES(3 , 1 , 'Janet' , $5000.00)
INSERT INTO Employees VALUES(4 , 1 , 'Margaret', $5000.00)
INSERT INTO Employees VALUES(5 , 2 , 'Steven' , $2500.00)
INSERT INTO Employees VALUES(6 , 2 , 'Michael' , $2500.00)
INSERT INTO Employees VALUES(7 , 3 , 'Robert' , $2500.00)
INSERT INTO Employees VALUES(8 , 3 , 'Laura' , $2500.00)
INSERT INTO Employees VALUES(9 , 3 , 'Ann' , $2500.00)
INSERT INTO Employees VALUES(10, 4 , 'Ina' , $2500.00)
INSERT INTO Employees VALUES(11, 7 , 'David' , $2000.00)
INSERT INTO Employees VALUES(12, 7 , 'Ron' , $2000.00)
INSERT INTO Employees VALUES(13, 7 , 'Dan' , $2000.00)
INSERT INTO Employees VALUES(14, 11 , 'James' , $1500.00)
Each employee reports to a manager whose ID is stored in the mgrid column. A foreign key is defined on the mgrid column referencing the empid column, meaning that a manager ID must either correspond to a valid employee ID within the table or be NULL. Nancy, the boss, has NULL in the mgrid column. Manager-employee relationships are shown in Figure 1.
Figure 1. Employees organizational chart
The following are some common requests that might be run on the Employees table:
- Show me details about Robert (empid=7) and all of his subordinates in all levels.
- Show me details about all employees that are two levels under Janet (empid=3).
- Show me the chain of management leading to James (empid=14).
- Show me how many employees report to each manager directly or indirectly.
- Show me all of the employees in such a way that it will be easy to see their hierarchical dependencies.
Recursive CTEs provide the means to deal with these requests, which are recursive in nature, without the need to maintain additional information in the database about the hierarchy.
The first request is probably the most common one: returning an employee (for example, Robert whose empid=7) and his/her subordinates in all levels. The following CTE provides a solution to this request:
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
(
-- Anchor Member (AM)
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = 7
UNION ALL
-- Recursive Member (RM)
SELECT E.empid, E.empname, E.mgrid, M.lvl+1
FROM Employees AS E
JOIN EmpCTE AS M
ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
Here is the result set:
empid empname mgrid lvl
----------- ------------------------- ----------- -----------
7 Robert 3 0
11 David 7 1
12 Ron 7 1
13 Dan 7 1
14 James 11 2
Following the recursive CTE logic described previously, this CTE is processed as follows:
The anchor member is activated, returning Robert's row from the Employees table. Notice the constant 0 that is returned in the lvl result column.
- The recursive member is activated repeatedly, returning direct subordinates of the previous result by means of a join operation between Employees and EmpCTE. Employees represents the subordinates, and EmpCTE (which contains the result from the previous invocation) represents managers:
- First, Robert's subordinates are returned: David, Ron, and James.
- And then David's, Ron's, and Dan's subordinates are returned: only James.
- Finally, James' subordinates are returned: none, in which case, recursion is terminated.
- The outer query returns all rows from EmpCTE.
Notice that the lvl value is repeatedly incremented with each recursive invocation.
Using this level counter you can limit the number of iterations in the recursion. For example, the following CTE is used to return all employees who are two levels below Janet:
WITH EmpCTEJanet(empid, empname, mgrid, lvl)
AS
(
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = 3
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1
FROM Employees as E
JOIN EmpCTEJanet as M
ON E.mgrid = M.empid
WHERE lvl < 2
)
SELECT empid, empname
FROM EmpCTEJanet
WHERE lvl = 2
Here is the result set:
empid empname
----------- -------------------------
11 David
12 Ron
13 Dan
The additions in this code example compared to the previous one are shown in bold. The filter WHERE lvl < 2 in the recursive member is used as a recursion termination check—no rows are returned when lvl = 2, thus recursion stops. The filter WHERE lvl = 2 in the outer query is used to remove all levels up to level 2. Note that logically the filter in the outer query (lvl = 2) is sufficient by itself to return only the desired rows. The filter in the recursive member (lvl < 2) is added for performance reasons—to stop the recursion early, as soon as two levels below Janet are returned.
As mentioned earlier, CTEs can refer to local variables that are defined within the same batch. For example, to make the query more generic, you can use variables instead of constants for employee ID and level:
DECLARE @empid AS INT, @lvl AS INT
SET @empid = 3 -- Janet
SET @lvl = 2 -- two levels
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
(
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = @empid
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1
FROM Employees as E
JOIN EmpCTE as M
ON E.mgrid = M.empid
WHERE lvl < @lvl
)
SELECT empid, empname
FROM EmpCTE
WHERE lvl = @lvl
You can use a hint to force termination of the query after a certain number of recursive iterations have been invoked. You do that by adding OPTION(MAXRECURSION value) at the end of the outer query, as shown in the following example:
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
(
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = 1
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1
FROM Employees as E
JOIN EmpCTE as M
ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
OPTION (MAXRECURSION 2)
Here is the result set:
empid empname mgrid lvl
----------- ------------------------- ----------- -----------
1 Nancy NULL 0
2 Andrew 1 1
3 Janet 1 1
4 Margaret 1 1
10 Ina 4 2
7 Robert 3 2
8 Laura 3 2
9 Ann 3 2
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 2 has been exhausted before statement completion
Results generated thus far might be returned (but are not guaranteed to be), and error 530 is generated. You might think of using the MAXRECURSION option to implement the request to return employees who are two levels below Janet using the MAXRECURSION hint instead of the filter in the recursive member:
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
(
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = 3
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1
FROM Employees as E
JOIN EmpCTE as M
ON E.mgrid = M.empid
)
SELECT empid, empname
FROM EmpCTE
WHERE lvl = 2
OPTION (MAXRECURSION 2)
Here is the result set:
empid empname
----------- -------------------------
11 David
12 Ron
13 Dan
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 2 has been exhausted before statement completion
But keep in mind that, besides the fact that there is no guarantee that results will be returned, your client will get an error. It is not good programming practice to use code that returns errors in valid situations. It is recommended that you use the filter presented earlier and, if you want, the MAXRECURSION hint as a safeguard against infinite loops.
When this hint is not specified, SQL Server defaults to a value of 100. This value can be used as a safeguard when you suspect cyclic recursive calls. If you do not want to limit the number of recursive calls, set MAXRECURSION to 0 in the hint.
As an example of a cyclic relationship, suppose you had a bug in your data and Nancy's manager was accidentally changed to James (instead of no manager):
UPDATE Employees SET mgrid = 14 WHERE empid = 1
The following cycle is introduced: 1->3->7->11->14->1. If you try running code that returns Nancy and her direct and indirect subordinates at all levels, you get an error indicating that the default maximum recursion of 100 was exhausted before the statement completed:
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
(
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = 1
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1
FROM Employees AS E
JOIN EmpCTE AS M
ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 100 has been exhausted before statement completion
Of course it is good to have a safety measure that prevents infinite recursive calls, but MAXRECURSION doesn't help you much in isolating the cycle and resolving the bug in the data. In order to isolate the cycle, you can use a CTE that will construct, for each employee, an enumerated path of all of the employee IDs leading to the employee. Call this result column path. In the recursive member, use a CASE expression to check whether the current employee ID already appears in the manager's path using a LIKE predicate. If it does, this means you found a cycle. If a cycle is found, return 1 in a result column called cycle, otherwise return 0. Also, add a filter to the recursive member that ensures that only subordinates of managers for which a cycle was not detected will be returned. Finally, add a filter to the outer query that returns only employees for which a cycle was found (cycle = 1):
WITH EmpCTE(empid, path, cycle)
AS
(
SELECT empid,
CAST('.' + CAST(empid AS VARCHAR(10)) + '.' AS VARCHAR(900)),
0
FROM Employees
WHERE empid = 1
UNION ALL
SELECT E.empid,
CAST(M.path + CAST(E.empid AS VARCHAR(10)) + '.' AS VARCHAR(900)),
CASE
WHEN M.path LIKE '%.' + CAST(E.empid AS VARCHAR(10)) + '.%' THEN 1
ELSE 0
END
FROM Employees AS E
JOIN EmpCTE AS M
ON E.mgrid = M.empid
WHERE M.cycle = 0
)
SELECT path FROM EmpCTE
WHERE cycle = 1
path
---------------
.1.3.7.11.14.1.
Note that corresponding columns in both the anchor member and the recursive member must have the same data type, length, and precision. That's why the expression generating the path value is converted to varbinary(900) in both members. Once the cycle is detected, you can fix the bug in your data by changing Nancy's manager back to no manager:
UPDATE Employees SET mgrid = NULL WHERE empid = 1
The recursive examples provided up to this point have an anchor member that is a manager and a recursive member that retrieves subordinates. Some requests require the opposite; for example, a request to return James' management path (James and all of his managers at all levels). The following code provides an answer for this request:
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
(
SELECT empid, empname, mgrid, 0
FROM Employees
WHERE empid = 14
UNION ALL
SELECT M.empid, M.empname, M.mgrid, E.lvl+1
FROM Employees as M
JOIN EmpCTE as E
ON M.empid = E.mgrid
)
SELECT * FROM EmpCTE
Here is the result set:
empid empname mgrid lvl
----------- ------------------------- ----------- -----------
14 James 11 0
11 David 7 1
7 Robert 3 2
3 Janet 1 3
1 Nancy NULL 4
The anchor member returns James' row. The recursive member returns the managers of the previously returned employees or manager in singular, because a single-parent hierarchy is used here and the request starts with a single employee.
You can also use recursive queries to calculate aggregations, such as the number of subordinates that report to each manager directly or indirectly:
WITH MgrCTE(mgrid, lvl)
AS
(
SELECT mgrid, 0
FROM Employees
WHERE mgrid IS NOT NULL
UNION ALL
SELECT M.mgrid, lvl + 1
FROM Employees AS M
JOIN MgrCTE AS E
ON E.mgrid = M.empid
WHERE M.mgrid IS NOT NULL
)
SELECT mgrid, COUNT(*) AS cnt
FROM MgrCTE
GROUP BY mgrid
Here is the result set:
mgrid cnt
----------- -----------
1 13
2 2
3 7
4 1
7 4
11 1
The anchor member returns a row with the manager ID for each employee. NULL in the manager ID column is excluded because it represents no specific manager. The recursive member returns the manager IDs of the managers of the previously returned managers, again NULLs are excluded. Eventually, the CTE contains, for each manager, as many occurrences as their direct or indirect number of subordinates. The outer query is left with the tasks of grouping the result by manager ID and returning the count of occurrences.
As another example of a request against a single-parent hierarchy, suppose you want to return Nancy's subordinates sorted and indented according to hierarchical dependencies. The following code does just that, sorting siblings according to their employee IDs:
WITH EmpCTE(empid, empname, mgrid, lvl, sortcol)
AS
(
SELECT empid, empname, mgrid, 0,
CAST(empid AS VARBINARY(900))
FROM Employees
WHERE empid = 1
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1,
CAST(sortcol + CAST(E.empid AS BINARY(4)) AS VARBINARY(900))
FROM Employees AS E
JOIN EmpCTE AS M
ON E.mgrid = M.empid
)
SELECT
REPLICATE(' | ', lvl)
+ '(' + (CAST(empid AS VARCHAR(10))) + ') '
+ empname AS empname
FROM EmpCTE
ORDER BY sortcol
(1) Nancy
| (2) Andrew
| | (5) Steven
| | (6) Michael
| (3) Janet
| | (7) Robert
| | | (11) David
| | | | (14) James
| | | (12) Ron
| | | (13) Dan
| | (8) Laura
| | (9) Ann
| (4) Margaret
| | (10) Ina
To sort siblings according to the empid value, form a binary string called sortcol for each employee. The string is made of concatenated employee IDs in the chain of management leading to each employee, converted to binary values. The anchor member is the starting point. It generates a binary value with the empid of the root employee. In each iteration, the recursive member appends the current employee ID converted to a binary value to the manager's sortcol. The outer query then sorts the result by sortcol. Remember that corresponding columns in both the anchor member and the recursive member must have the same data type, length, and precision. That's why the expression generating the sortcol value is converted to varbinary(900), even though an integer requires 4 bytes in its binary representation: 900 bytes cover 225 levels, which seems more than a reasonable limitation. If you want support for more levels, you can increase this length but make sure you do so in both members; otherwise, you will get an error.
Hierarchical indentation is achieved by replicating a character string (' | ' in this case) as many times as the number levels of the employee. To that, the employee ID itself is appended within parentheses, and finally the employee name is also appended.
A similar technique can be used to sort siblings by other attributes that can be converted to small fixed-length binary values; for example, the employee's hire date stored in a smalldatetime column. If you want to sort siblings by attributes that are not convertible to small fixed-sized binary values, such as by employee name, you can first produce integer row numbers (for details on row numbers, see the section "Ranking Functions" earlier in this paper) partitioned by manager ID representing the desired sort like so:
SELECT empid, empname, mgrid,
ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname) AS pos
FROM Employees
And instead of concatenating employee ID's converted to binary values, concatenate the employee positions converted to binary values:
WITH EmpPos(empid, empname, mgrid, pos)
AS
(
SELECT empid, empname, mgrid,
ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname) AS pos
FROM Employees
),
EmpCTE(empid, empname, mgrid, lvl, sortcol)
AS
(
SELECT empid, empname, mgrid, 0,
CAST(pos AS VARBINARY(900))
FROM EmpPos
WHERE empid = 1
UNION ALL
SELECT E.empid, E.empname, E.mgrid, M.lvl+1,
CAST(sortcol + CAST(E.pos AS BINARY(4)) AS VARBINARY(900))
FROM EmpPos AS E
JOIN EmpCTE AS M
ON E.mgrid = M.empid
)
SELECT
REPLICATE(' | ', lvl)
+ '(' + (CAST(empid AS VARCHAR(10))) + ') '
+ empname AS empname
FROM EmpCTE
ORDER BY sortcol
(1) Nancy
| (2) Andrew
| | (6) Michael
| | (5) Steven
| (3) Janet
| | (9) Ann
| | (8) Laura
| | (7) Robert
| | | (13) Dan
| | | (11) David
| | | | (14) James
| | | (12) Ron
| (4) Margaret
| | (10) Ina
To sort siblings by any other attribute or combination of attributes, simply specify the desired attributes in the ORDER BY list of the ROW_NUMBER function's OVER option instead of empname.
In the previous section, CTEs are used to handle hierarchies in which each node in the tree has only a single parent. A more complex scenario of relationships is a graph in which each node might have more than one parent. This section describes the use of CTEs in a bill of materials (BOM) scenario. The BOM is an Acyclic Directed Graph, meaning that each node can have more than one parent; a node cannot be a parent of itself, directly or indirectly; the relationship between two nodes is not dual (for example, A contains C, but C does not contain A). Figure 2 shows the relationships between items in a BOM scenario.
Figure 2. Multiparent environment
Item A contains items D, B and C; item C contains B and E; item B is contained in items A and C, and so on. The following code creates the Items and BOM tables and populates them with sample data:
CREATE TABLE Items
(
itemid VARCHAR(5) NOT NULL PRIMARY KEY,
itemname VARCHAR(25) NOT NULL,
/* other columns, e.g., unit_price, measurement_unit */
)
CREATE TABLE BOM
(
itemid VARCHAR(5) NOT NULL REFERENCES Items,
containsid VARCHAR(5) NOT NULL REFERENCES Items,
qty INT NOT NULL
/* other columns, e.g., quantity */
PRIMARY KEY(itemid, containsid),
CHECK (itemid <> containsid)
)
SET NOCOUNT ON
INSERT INTO Items(itemid, itemname) VALUES('A', 'Item A')
INSERT INTO Items(itemid, itemname) VALUES('B', 'Item B')
INSERT INTO Items(itemid, itemname) VALUES('C', 'Item C')
INSERT INTO Items(itemid, itemname) VALUES('D', 'Item D')
INSERT INTO Items(itemid, itemname) VALUES('E', 'Item E')
INSERT INTO Items(itemid, itemname) VALUES('F', 'Item F')
INSERT INTO Items(itemid, itemname) VALUES('G', 'Item G')
INSERT INTO Items(itemid, itemname) VALUES('H', 'Item H')
INSERT INTO Items(itemid, itemname) VALUES('I', 'Item I')
INSERT INTO Items(itemid, itemname) VALUES('J', 'Item J')
INSERT INTO Items(itemid, itemname) VALUES('K', 'Item K')
INSERT INTO BOM(itemid, containsid, qty) VALUES('E', 'J', 1)
INSERT INTO BOM(itemid, containsid, qty) VALUES('C', 'E', 3)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'C', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('H', 'C', 4)
INSERT INTO BOM(itemid, containsid, qty) VALUES('C', 'B', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('B', 'F', 1)
INSERT INTO BOM(itemid, containsid, qty) VALUES('B', 'G', 3)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'B', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'D', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('H', 'I', 1)
The Items table contains a row for each item. The BOM table contains the relationships between the nodes in the graph. Each relationship is made up of the parent item ID (itemid), the child item ID (containsid), and the quantity of containsid within itemid (qty).
A common request in a BOM scenario is to "explode" an item: that is, to traverse the graph, starting with the given item and return all the items that it contains, directly or indirectly. This might sound familiar because it's similar to returning a subtree out of a tree as in the Employees Organizational Chart. In a directed graph, however, the request is a bit more complex conceptually, because one contained item can be reached from several different containing items through different paths. For example, suppose you want to explode item A. Notice that two different paths lead from it to item B: A->B and A->C->B. This means that item B would be reached twice, which means that all of the items that B contains (F and G) would be reached twice. Fortunately, with CTEs, the implementation of such a request is as simple as implementing a request to get a subtree out of a tree:
WITH BOMCTE
AS
(
SELECT *
FROM BOM
WHERE itemid = 'A'
UNION ALL
SELECT BOM.*
FROM BOM
JOIN BOMCTE
ON BOM.itemid = BOMCTE.containsid
)
SELECT * FROM BOMCTE
Here is the result set:
itemid containsid qty
------ ---------- -----------
A B 2
A C 2
A D 2
C B 2
C E 3
E J 1
B F 1
B G 3
B F 1
B G 3
The anchor member returns all the direct items that A contains from BOM. For each contained item returned by the previous iteration of the CTE, the recursive member returns the items that it contains by joining BOM to BOMCTE. Logically, (not necessarily the order in the output) (A, B), (A, C), (A, D) are returned first; then (B, F), (B, G), (C, B), (C, E); and lastly (B, F), (B, G), (E, J). Note that most requests from a BOM do not require you to show an item more then once in the final results. You can use the DISTINCT clause to eliminate duplicates if you want to show only "which" items were involved in the explosion:
WITH BOMCTE
AS
(
SELECT *
FROM BOM
WHERE itemid = 'A'
UNION ALL
SELECT BOM.*
FROM BOM
JOIN BOMCTE
ON BOM.itemid = BOMCTE.containsid
)
SELECT DISTINCT containsid FROM BOMCTE
Here is the result set:
containsid
----------
B
C
D
E
F
G
J
To help you understand the process of parts explosion, visualize its intermediate result as a tree in which all items are expanded to their contained items. Figure 3 shows the trees formed by exploding parts A and H along with the item quantities.
Figure 3. Parts explosion
Taking the original request a step further, you might be more interested in getting the cumulative quantities of each item rather than getting the items themselves. For example, A contains 2 units of C. C contains 3 units of E. E contains one unit of J. The total number of units of J required for A is a product of the quantities along the path leading from A to J: 2*3*1 = 6. Figure 4 shows the cumulative quantities of each item that makes A before aggregating the items.
Figure 4. Parts explosion—calculated quantities
The following CTE calculates the cumulative product of quantities:
WITH BOMCTE(itemid, containsid, qty, cumulativeqty)
AS
(
SELECT *, qty
FROM BOM
WHERE itemid = 'A'
UNION ALL
SELECT BOM.*, BOM.qty * BOMCTE.cumulativeqty
FROM BOM
JOIN BOMCTE
ON BOM.itemid = BOMCTE.containsid
)
SELECT * FROM BOMCTE
Here is the result set:
itemid containsid qty cumulativeqty
------ ---------- ----------- -------------
A B 2 2
A C 2 2
A D 2 2
C B 2 4
C E 3 6
E J 1 6
B F 1 4
B G 3 12
B F 1 2
B G 3 6
This CTE adds the cumulativeqty column to the previous CTE. The anchor member returns the quantities of the contained items as cumulativeqty. For each of the next level's contained item, the recursive member multiplies its quantity by its containing item's cumulative quantity. Note that items that were reached from multiple paths appear multiple times in the results, each with the cumulative quantities for each path. Such an output is not very meaningful by itself, but it is helpful to understand the intermediate step to the final results in which each item appears only once. To get the total quantities of each item in A, have the outer query group the result by containsid:
WITH BOMCTE(itemid, containsid, qty, cumulativeqty)
AS
(
SELECT *, qty
FROM BOM
WHERE itemid = 'A'
UNION ALL
SELECT BOM.*, BOM.qty * BOMCTE.cumulativeqty
FROM BOM
JOIN BOMCTE
ON BOM.itemid = BOMCTE.containsid
)
SELECT containsid AS itemid, SUM(cumulativeqty) AS totalqty
FROM BOMCTE
GROUP BY containsid
Here is the result set:
itemid totalqty
------ -----------
B 6
C 2
D 2
E 6
F 6
G 18
J 6
PIVOT and UNPIVOT are new relational operators that you specify in the FROM clause of a query. They perform some manipulation on an input table-valued expression and produce an output table as a result. The PIVOT operator rotates rows into columns, possibly performing aggregations along the way. It widens the input table expression based on a given pivot column, generating an output table with a column for each unique value in the pivot column. The UNPIVOT operator performs the opposite operation of that performed by the PIVOT operator; it rotates columns into rows. It narrows the input table expression based on a pivot column.
The PIVOT operator is useful for handling open-schema scenarios and for generating crosstab reports.
In an open-schema scenario, you maintain entities with sets of attributes that are either not known ahead or different for each entity type. The users of your application define the attributes dynamically. Instead of predefining many columns and storing many null values in your tables, you split the attributes into different rows and store only the relevant attributes for each entity instance.
PIVOT allows you to generate crosstab reports for open-schema and other scenarios in which you rotate rows into columns, possibly calculating aggregations along the way and presenting the data in a useful form.
An example of an open-schema scenario is a database that keeps track of items put up for auction. Some attributes are relevant for all auction items, such as the item type, when it was made, and its initial price. Only the attributes that are relevant for all items are stored in the AuctionItems table:
CREATE TABLE AuctionItems
(
itemid INT NOT NULL PRIMARY KEY NONCLUSTERED,
itemtype NVARCHAR(30) NOT NULL,
whenmade INT NOT NULL,
initialprice MONEY NOT NULL,
/* other columns */
)
CREATE UNIQUE CLUSTERED INDEX idx_uc_itemtype_itemid
ON AuctionItems(itemtype, itemid)
INSERT INTO AuctionItems VALUES(1, N'Wine', 1822, 3000)
INSERT INTO AuctionItems VALUES(2, N'Wine', 1807, 500)
INSERT INTO AuctionItems VALUES(3, N'Chair', 1753, 800000)
INSERT INTO AuctionItems VALUES(4, N'Ring', -501, 1000000)
INSERT INTO AuctionItems VALUES(5, N'Painting', 1873, 8000000)
INSERT INTO AuctionItems VALUES(6, N'Painting', 1889, 8000000)
Other attributes are specific to the item type, and new items of different types are continuously being added. Such attributes can be stored in a different ItemAttributes table in which each item attribute is stored in a different row. Each row contains the item ID, attribute name, and attribute value:
CREATE TABLE ItemAttributes
(
itemid INT NOT NULL REFERENCES AuctionItems,
attribute NVARCHAR(30) NOT NULL,
value SQL_VARIANT NOT NULL,
PRIMARY KEY (itemid, attribute)
)
INSERT INTO ItemAttributes
VALUES(1, N'manufacturer', CAST(N'ABC' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(1, N'type', CAST(N'Pinot Noir' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(1, N'color', CAST(N'Red' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(2, N'manufacturer', CAST(N'XYZ' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(2, N'type', CAST(N'Porto' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(2, N'color', CAST(N'Red' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(3, N'material', CAST(N'Wood' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(3, N'padding', CAST(N'Silk' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(4, N'material', CAST(N'Gold' AS NVARCHAR(15)))
INSERT INTO ItemAttributes
VALUES(4, N'inscription', CAST(N'One ring ...' AS NVARCHAR(50)))
INSERT INTO ItemAttributes
VALUES(4, N'size', CAST(10 AS INT))
INSERT INTO ItemAttributes
VALUES(5, N'artist', CAST(N'Claude Monet' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(5, N'name', CAST(N'Field of Poppies' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(5, N'type', CAST(N'Oil' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(5, N'height', CAST(19.625 AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
VALUES(5, N'width', CAST(25.625 AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
VALUES(6, N'artist', CAST(N'Vincent Van Gogh' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(6, N'name', CAST(N'The Starry Night' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(6, N'type', CAST(N'Oil' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
VALUES(6, N'height', CAST(28.75 AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
VALUES(6, N'width', CAST(36.25 AS NUMERIC(9,3)))
Note that the sql_variant data type is used for the value column because different attribute values may be of different data types. For example, the size attribute stores an integer attribute value, and a name attribute stores a character string attribute value.
Suppose you want to present the data from the ItemAttributes table with a row for each item that is a painting (items 5,6) and a column for each attribute. Without the PIVOT operator, you must write a query such as:
SELECT
itemid,
MAX(CASE WHEN attribute = 'artist' THEN value END) AS [artist],
MAX(CASE WHEN attribute = 'name' THEN value END) AS [name],
MAX(CASE WHEN attribute = 'type' THEN value END) AS [type],
MAX(CASE WHEN attribute = 'height' THEN value END) AS [height],
MAX(CASE WHEN attribute = 'width' THEN value END) AS [width]
FROM ItemAttributes AS ATR
WHERE itemid IN(5,6)
GROUP BY itemid
Here is the result set:
itemid artist name type height width
------ ---------------- ---------------- ---------- ------ ------
5 Claude Monet Field of Poppies Oil 19.625 25.625
6 Vincent Van Gogh The Starry Night Oil 28.750 36.250
The PIVOT operator allows you to maintain shorter and more readable code to achieve the same results:
SELECT *
FROM ItemAttributes AS ATR
PIVOT
(
MAX(value)
FOR attribute IN([artist], [name], [type], [height], [width])
) AS PVT
WHERE itemid IN(5,6)
As with most new features, understanding the PIVOT operator comes with experimentation and use. Some of the elements in the PIVOT syntax are apparent and only require that you figure out the relationship of these elements to the query that does not use the new operator. Others are hidden.
You may find the following terms helpful in understanding the semantics of the PIVOT operator:
table_expression
The virtual table on which the PIVOT operator works (the part in the query between the FROM clause and the PIVOT operator): ItemAttributes AS ATR in this case.
pivot_column
The column from table_expression whose values you want to rotate into result columns: attribute in this case.
column_list
The list of values from pivot_column that you want to present as result columns (in the parentheses followed by the IN clause). These must be expressed as legal identifiers: [artist], [name], [type], [height], [width] in this case.
aggregate_function
The aggregate function that you use to generate the data or column values in the result: MAX() in this case.
value_column
The column from table_expression that you use as the argument for aggregate_function: value in this case.
group_by_list
The hidden part—ALL columns from table_expression excluding pivot_column and value_column which are used to group the result: itemid in this case.
select_list
The list of columns following the SELECT clause that might include any column(s) from group_by_list and column_list. Aliases can be used to change the name of the result columns: * in this case returns all columns from group_by_list and column_list.
The PIVOT operator returns one row for each unique value in group_by_list as if you had a query with a GROUP BY clause and specified those columns. Notice that group_by_list is implied; it is not specified explicitly anywhere in the query. It contains all columns from table_expression excluding pivot_column and value_column. Understanding this is probably the key to understanding why queries you write with the PIVOT operator work as they do, and why you might get errors in some cases.
Possible result columns include values from group_by_list and <column_list>. If you specify an asterisk (*), the query returns both lists. The data part of the result columns or the result column values are calculated by aggregate_function with value_column as the argument.
The following color-highlighted code illustrates the different elements in the query using the PIVOT operator:
SELECT * -- itemid, [artist], [name], [type], [height], [width]
FROM ItemAttributes AS ATR
PIVOT
(
MAX(value)
FOR attribute IN([artist], [name], [type], [height], [width])
) AS PVT
WHERE itemid IN(5,6)
And the following relates the different elements to the query that does not use the PIVOT operator:
SELECT
itemid,
MAX(CASE WHEN attribute = 'artist' THEN value END) AS [artist],
MAX(CASE WHEN attribute = 'name' THEN value END) AS [name],
MAX(CASE WHEN attribute = 'type' THEN value END) AS [type],
MAX(CASE WHEN attribute = 'height' THEN value END) AS [height],
MAX(CASE WHEN attribute = 'width' THEN value END) AS [width]
FROM ItemAttributes AS ATR
WHERE itemid IN(5,6)
GROUP BY itemid
Note that you must explicitly specify the values in <column_list>. The PIVOT operator does not provide an option to derive those dynamically from pivot_column in a static query. You can use dynamic SQL to construct the query string yourself to achieve this.
Taking the previous PIVOT query a step further, suppose you want to return, for each auction item, all attributes relevant to paintings. You want to include those attributes that appear in AuctionItems and those that appear in ItemAttributes. You might try the following query, which returns an error:
SELECT *
FROM AuctionItems AS ITM
JOIN ItemAttributes AS ATR
ON ITM.itemid = ATR.itemid
PIVOT
(
MAX(value)
FOR attribute IN([artist], [name], [type], [height], [width])
) AS PVT
WHERE itemtype = 'Painting'
Here is the error message:
.Net SqlClient Data Provider: Msg 8156, Level 16, State 1, Line 1
The column 'itemid' was specified multiple times for 'PVT'.
Remember that PIVOT works on table_expression, which is the virtual table returned by the section in the query between the FROM clause and the PIVOT clause. In this query, the virtual table contains two instances of the itemid column—one originating from AuctionItems and the other from ItemAttributes. You might be tempted to revise the query as follows, but you will also get an error:
SELECT ITM.itemid, itemtype, whenmade, initialprice,
[artist], [name], [type], [height], [width]
FROM AuctionItems AS ITM
JOIN ItemAttributes AS ATR
ON ITM.itemid = ATR.itemid
PIVOT
(
MAX(value)
FOR attribute IN([artist], [name], [type], [height], [width])
) AS PVT
WHERE itemtype = 'Painting'
Here is the error message:
.Net SqlClient Data Provider: Msg 8156, Level 16, State 1, Line 1
The column 'itemid' was specified multiple times for 'PVT'.
.Net SqlClient Data Provider: Msg 107, Level 15, State 1, Line 1
The column prefix 'ITM' does not match with a table name or alias name used in the query.
As mentioned earlier, the PIVOT operator works on the virtual table returned by table_expression and not on the columns in the select_list. The select_list is evaluated after the PIVOT operator performs its manipulations and can refer only to group_by_list and column_list. That is why the ITM alias is no longer recognized in the select_list. If you understand this, you realize that you should provide PIVOT with a table_expression that contains only the columns you want to work on. This includes the grouping columns (only one occurrence of itemid plus itemtype, whenmade and initialprice), the pivot column (attribute), and the value column (value). You can achieve this by using CTEs or derived tables. Here is an example using a CTE:
WITH PNT
AS
(
SELECT ITM.*, ATR.attribute, ATR.value
FROM AuctionItems AS ITM
JOIN ItemAttributes AS ATR
ON ITM.itemid = ATR.itemid
WHERE ITM.itemtype = 'Painting'
)
SELECT * FROM PNT
PIVOT
(
MAX(value)
FOR attribute IN([artist], [name], [type], [height], [width])
) AS PVT
Here is the result set:
itemid itemtype whenmade initialprice artist name type height width
------ -------- -------- ------------ ---------------- ---------------- ---- ------ -----
5 Painting 1873 8000000.0000 Claude Monet Field of Poppies Oil 19.62 25.62
6 Painting 1889 8000000.0000 Vincent Van Gogh The Starry Night Oil 28.75 36.25
Here is an example using a derived table:
SELECT *
FROM (SELECT ITM.*, ATR.attribute, ATR.value
FROM AuctionItems AS ITM
JOIN ItemAttributes AS ATR
ON ITM.itemid = ATR.itemid
WHERE ITM.itemtype = 'Painting') AS PNT
PIVOT
(
MAX(value)
FOR attribute IN([artist], [name], [type], [height], [width])
) AS PVT
You can also use PIVOT when you want to generate a crosstab report to summarize data. For example, using the Purchasing.PurchaseOrderHeader table in the AdventureWorks database, suppose you want to return the number of orders each employee made using each purchasing method, pivoting the purchase method IDs to columns. Keeping in mind that you should provide the PIVOT operator with only the relevant data, you use a derived table and write the following query:
SELECT EmployeeID, [1] AS SM1, [2] AS SM2,
[3] AS SM3, [4] AS SM4, [5] AS SM5
FROM (SELECT PurchaseOrderID, EmployeeID, ShipMethodID
FROM Purchasing.PurchaseOrderHeader) ORD
PIVOT
(
COUNT(PurchaseOrderID)
FOR ShipMethodID IN([1], [2], [3], [4], [5])
) AS PVT
Here is the result set:
EmployeeID SM1 SM2 SM3 SM4 SM5
----------- ----------- ----------- ----------- ----------- -----------
164 56 62 12 89 141
198 24 27 6 45 58
223 56 67 17 98 162
231 50 67 12 81 150
233 55 62 12 106 125
238 53 58 13 102 134
241 50 59 13 108 130
244 55 47 17 93 148
261 58 54 11 120 117
264 50 58 15 86 151
266 58 68 14 116 144
274 24 26 6 41 63
The COUNT(PurchaseOrderID) function counts the number of rows for each ship method in the list. Note that PIVOT disallows the use of COUNT(*). Column aliases are used to provide more descriptive names to the result columns. Using PIVOT to show an order count for each ship method in a different column is reasonable when you have a small number of ship methods whose IDs are known ahead of time.
You can also pivot values that are derived from expressions. For example, suppose you want to return the total freight value for each employee in each order year, pivoting the years to columns. The order year is derived from the OrderDate column:
SELECT EmployeeID, [2001] AS Y2001, [2002] AS Y2002,
[2003] AS Y2003, [2004] AS Y2004
FROM (SELECT EmployeeID, YEAR(OrderDate) AS OrderYear, Freight
FROM Purchasing.PurchaseOrderHeader) AS ORD
PIVOT
(
SUM(Freight)
FOR OrderYear IN([2001], [2002], [2003], [2004])
) AS PVT
Here is the result set:
EmployeeID Y2001 Y2002 Y2003 Y2004
----------- ----------- ----------- ----------- ------------
164 509.9325 14032.0215 34605.3459 105087.7428
198 NULL 5344.4771 14963.0595 45020.9178
223 365.7019 12496.0776 37489.2896 117599.4156
231 6.8025 9603.0502 37604.3258 75435.8619
233 1467.1388 9590.7355 32988.0643 98603.745
238 17.3345 9745.1001 37836.583 100106.3678
241 221.1825 6865.7299 35559.3883 114430.983
244 5.026 5689.4571 35449.316 74690.3755
261 NULL 10483.27 32854.9343 73992.8431
264 NULL 10337.3207 37170.1957 82406.4474
266 4.2769 9588.8228 38533.9582 115291.2472
274 NULL 1877.2665 13708.9336 41011.3821
Crosstab reports are common in data warehouse scenarios. Consider the following OrdersFact table, which you populate with sales orders and sales order details data from AdventureWorks:
CREATE TABLE OrdersFact
(
OrderID INT NOT NULL,
ProductID INT NOT NULL,
CustomerID NCHAR(5) NOT NULL,
OrderYear INT NOT NULL,
OrderMonth INT NOT NULL,
OrderDay INT NOT NULL,
Quantity INT NOT NULL,
PRIMARY KEY(OrderID, ProductID)
)
INSERT INTO OrdersFact
SELECT O.SalesOrderID, OD.ProductID, O.CustomerID,
YEAR(O.OrderDate) AS OrderYear, MONTH(O.OrderDate) AS OrderMonth,
DAY(O.OrderDate) AS OrderDay, OD.OrderQty
FROM Sales.SalesOrderHeader AS O
JOIN Sales.SalesOrderDetail AS OD
ON O.SalesOrderID = OD.SalesOrderID
To get the total quantities for each year and month, returning years in rows and months in columns, you use the following query:
SELECT *
FROM (SELECT OrderYear, OrderMonth, Quantity
FROM OrdersFact) AS ORD
PIVOT
(
SUM(Quantity)
FOR OrderMonth IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])
) AS PVT
Here is the result set:
OrderYear 1 2 3 4 5 6 7 8 9 10 11 12
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
2001 NULL NULL NULL NULL NULL NULL 966 2209 1658 1403 3132 2480
2002 1040 2303 1841 1467 3179 2418 7755 11325 9066 5584 8268 6672
2003 3532 5431 4132 5694 8278 6444 11288 18986 18681 11607 14771 15855
2004 9227 10999 11314 12239 15656 15805 2209 NULL NULL NULL NULL NULL
PIVOT returns null values for nonexistent intersections between year and month. A year appears in the result if it appears in the input table expression (the derived table ORD), regardless of whether or not it has an intersection with any of the specified months. This means that you might get a row with NULL in all columns if you do not specify all existing months. However, null values in the result do not necessarily represent nonexistent intersections. They might result from base null values in the quantity column, unless the column disallows null values. If you want to override NULL and see another value instead, for example 0, you can do so by using the ISNULL() function in the select list:
SELECT OrderYear,
ISNULL([1], 0) AS M01,
ISNULL([2], 0) AS M02,
ISNULL([3], 0) AS M03,
ISNULL([4], 0) AS M04,
ISNULL([5], 0) AS M05,
ISNULL([6], 0) AS M06,
ISNULL([7], 0) AS M07,
ISNULL([8], 0) AS M08,
ISNULL([9], 0) AS M09,
ISNULL([10], 0) AS M10,
ISNULL([11], 0) AS M11,
ISNULL([12], 0) AS M12
FROM (SELECT OrderYear, OrderMonth, Quantity
FROM OrdersFact) AS ORD
PIVOT
(
SUM(Quantity)
FOR OrderMonth IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])
) AS PVT
Here is the result set:
OrderYear 1 2 3 4 5 6 7 8 9 10 11 12
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
2001 0 0 0 0 0 0 966 2209 1658 1403 3132 2480
2002 1040 2303 1841 1467 3179 2418 7755 11325 9066 5584 8268 6672
2003 3532 5431 4132 5694 8278 6444 11288 18986 18681 11607 14771 15855
2004 9227 10999 11314 12239 15656 15805 2209 0 0 0 0 0
Using ISNULL(Quantity, 0) within the derived table would only take care of base null values in the Quantity column (if such existed) not of null values that PIVOT generated for nonexistent intersections.
Suppose you want to return the total quantities for each customer ID in the range 1–9 against a combination of year and month values in the first quarter of each of the years 2003 and 2004. To get the year and month values in rows and customer IDs in columns, use the following query:
SELECT *
FROM (SELECT CustomerID, OrderYear, OrderMonth, Quantity
FROM OrdersFact
WHERE CustomerID BETWEEN 1 AND 9
AND OrderYear IN(2003, 2004)
AND OrderMonth IN(1, 2, 3)) AS ORD
PIVOT
(
SUM(Quantity)
FOR CustomerID IN([1],[2],[3],[4],[5],[6],[7],[8],[9])
) AS PVT
Here is the result set:
OrderYear OrderMonth 1 2 3 4 5 6 7 8 9
----------- ----------- ---- ---- ---- ---- ---- ---- ---- ---- ----
2003 1 NULL NULL NULL 105 NULL NULL 8 NULL NULL
2004 1 NULL NULL NULL 80 NULL NULL NULL NULL NULL
2003 2 NULL 5 NULL NULL NULL NULL NULL NULL 15
2004 2 NULL 10 NULL NULL NULL NULL NULL 6 3
2003 3 NULL NULL 105 NULL 15 NULL NULL NULL NULL
2004 3 NULL NULL 103 NULL 25 4 NULL NULL NULL
The implied group-by list in this case is OrderYear and OrderMonth, because CustomerID and Quantity are used as the pivot and value columns, respectively.
However, if you want the combination of year and month values to appear as columns, you must concatenate them yourself before passing them to the PIVOT operator, because there can be only one pivot column:
SELECT *
FROM (SELECT CustomerID, OrderYear*100+OrderMonth AS YM, Quantity
FROM OrdersFact
WHERE CustomerID BETWEEN 1 AND 9
AND OrderYear IN(2003, 2004)
AND OrderMonth IN(1, 2, 3)) AS ORD
PIVOT
(
SUM(Quantity)
FOR YM IN([200301],[200302],[200303],[200401],[200402],[200403])
) AS PVT
Here is the result set:
CustomerID 200301 200302 200303 200401 200402 200403
---------- ------ ------ ------ ------ ------ ------
2 NULL 5 NULL NULL 10 NULL
3 NULL NULL 105 NULL NULL 103
6 NULL NULL NULL NULL NULL 4
4 105 NULL NULL 80 NULL NULL
8 NULL NULL NULL NULL 6 NULL
5 NULL NULL 15 NULL NULL 25
7 8 NULL NULL NULL NULL NULL
9 NULL 15 NULL NULL 3 NULL
The UNPIVOT operator allows you to normalize prepivoted data. The syntax and elements of the UNPIVOT operator are similar to those of the PIVOT operator.
For example, consider the AuctionItems table from the previous section:
itemid itemtype whenmade initialprice
----------- ------------------------ ----------- --------------
1 Wine 1822 3000.0000
2 Wine 1807 500.0000
3 Chair 1753 800000.0000
4 Ring -501 1000000.0000
5 Painting 1873 8000000.0000
6 Painting 1889 8000000.0000
Suppose you want each attribute to appear in a different row similar to the way attributes are kept in the ItemAttributes table:
itemid attribute value
----------- --------------- -------
1 itemtype Wine
1 whenmade 1822
1 initialprice 3000.00
2 itemtype Wine
2 whenmade 1807
2 initialprice 500.00
3 itemtype Chair
3 whenmade 1753
3 initialprice 800000.00
4 itemtype Ring
4 whenmade -501
4 initialprice 1000000.00
5 itemtype Painting
5 whenmade 1873
5 initialprice 8000000.00
6 itemtype Painting
6 whenmade 1889
6 initialprice 8000000.00
In the UNPIVOT query, you want to rotate the columns itemtype, whenmade, and initialprice to rows. Each row should have the item ID, attribute, and value. The new column names that you must provide are attribute and value. They correspond to the pivot_column and value_column in the PIVOT operator. The attribute column should get the actual column names that you want to rotate (itemtype, whenmade, and initialprice) as values. The value column should get the values from the three different source columns into one destination column. To help clarify, first a version of an UNPIVOT query that is not valid is presented, followed by a valid one in which you apply some restrictions:
SELECT itemid, attribute, value
FROM AuctionItems
UNPIVOT
(
value FOR attribute IN([itemtype], [whenmade], [initialprice])
) AS UPV
As arguments to the PIVOT operator, you provide a name for value_column (value in this case) followed by the FOR clause. Following the FOR clause, provide a name for pivot_column (attribute in this case) and then an IN clause with the list of source column names you want to get as values in pivot_column. This list of columns is referred to as <column_list> in the PIVOT operator. This query generates the following error:
.Net SqlClient Data Provider: Msg 8167, Level 16, State 1, Line 1
Type of column 'whenmade' conflicts with the type of other columns specified in the UNPIVOT list.
The destination value column contains values originating from several different source columns (those that appear in <column_list>). Because the target of all column values is a single column, UNPIVOT requires that all columns in <column_list> have the same data type, length, and precision. To meet this restriction, you can provide the UNPIVOT operator with a table expression that converts the three columns to the same data type. The sql_variant data type is a good candidate because you can convert the different source columns to the same data type and still retain their original data types. Applying this restriction, you revise the previous query as follows and get the result you want:
SELECT itemid, attribute, value
FROM (SELECT itemid,
CAST(itemtype AS SQL_VARIANT) AS itemtype,
CAST(whenmade AS SQL_VARIANT) AS whenmade,
CAST(initialprice AS SQL_VARIANT) AS initialprice
FROM AuctionItems) AS ITM
UNPIVOT
(
value FOR attribute IN([itemtype], [whenmade], [initialprice])
) AS UPV
The data type of the result attribute column is sysname. This is the data type SQL Server uses for storing object names.
Note that the UNPIVOT operator eliminates null values in the value column from the result; therefore, it cannot be considered to be the exact reverse operation as the PIVOT operator.
Having rotated the columns in AuctionItems into rows, you can now union the result of the UNPIVOT operation with the rows from ItemAttributes to provide a unified result:
SELECT itemid, attribute, value
FROM (SELECT itemid,
CAST(itemtype AS SQL_VARIANT) AS itemtype,
CAST(whenmade AS SQL_VARIANT) AS whenmade,
CAST(initialprice AS SQL_VARIANT) AS initialprice
FROM AuctionItems) AS ITM
UNPIVOT
(
value FOR attribute IN([itemtype], [whenmade], [initialprice])
) AS UPV
UNION ALL
SELECT *
FROM ItemAttributes
ORDER BY itemid, attribute
Here is the result set:
itemid attribute value
----------- --------------- -------------
1 color Red
1 initialprice 3000.00
1 itemtype Wine
1 manufacturer ABC
1 type Pinot Noir
1 whenmade 1822
2 color Red
2 initialprice 500.00
2 itemtype Wine
2 manufacturer XYZ
2 type Porto
2 whenmade 1807
3 initialprice 800000.00
3 itemtype Chair
3 material Wood
3 padding Silk
3 whenmade 1753
4 initialprice 1000000.00
4 inscription One ring
4 itemtype Ring
4 material Gold
4 size 10
4 whenmade -501
5 height 19.625
5 initialprice 8000000.00
5 itemtype Painting
5 name Field of Poppies
5 artist Claude Monet
5 type Oil
5 whenmade 1873
5 width 25.625
6 height 28.750
6 initialprice 8000000.00
6 itemtype Painting
6 name The Starry Night
6 artist Vincent Van Gogh
6 type Oil
6 whenmade 1889
6 width 36.250
The APPLY relational operator allows you to invoke a specified table-valued function once per each row of an outer table expression. You specify APPLY in the FROM clause of a query, similar to the way you use the JOIN relational operator. APPLY comes in two forms: CROSS APPLY and OUTER APPLY. With the APPLY operator, SQL Server 2005 Beta 2 allows you to refer to a table-valued function in a correlated subquery.
CROSS APPLY invokes a table-valued function for each row in an outer table expression. You can refer to columns in the outer table as arguments to the table-valued function. CROSS APPLY returns a unified result set out of all of the results returned by the individual invocations of the table-valued function. If the table-valued function returns an empty set for a given outer row, that outer row is not returned in the result. For example, the following table-valued function accepts two integers as arguments and returns a table with one row, with the minimum and maximum values as columns:
CREATE FUNCTION dbo.fn_scalar_min_max(@p1 AS INT, @p2 AS INT) RETURNS TABLE
AS
RETURN
SELECT
CASE
WHEN @p1 < @p2 THEN @p1
WHEN @p2 < @p1 THEN @p2
ELSE COALESCE(@p1, @p2)
END AS mn,
CASE
WHEN @p1 > @p2 THEN @p1
WHEN @p2 > @p1 THEN @p2
ELSE COALESCE(@p1, @p2)
END AS mx
GO
SELECT * FROM fn_scalar_min_max(10, 20)
Here is the result set:
mn mx
----------- -----------
10 20
Given the following T1 table:
CREATE TABLE T1
(
col1 INT NULL,
col2 INT NULL
)
INSERT INTO T1 VALUES(10, 20)
INSERT INTO T1 VALUES(20, 10)
INSERT INTO T1 VALUES(NULL, 30)
INSERT INTO T1 VALUES(40, NULL)
INSERT INTO T1 VALUES(50, 50)
You want to invoke fn_scalar_min_max for each row in T1. You write a CROSS APPLY query as follows:
SELECT *
FROM T1 CROSS APPLY fn_scalar_min_max(col1, col2) AS M
Here is the result set:
col1 col2 mn mx
----------- ----------- ----------- -----------
10 20 10 20
20 10 10 20
NULL 30 30 30
40 NULL 40 40
50 50 50 50
If the table-valued function returns multiple rows for a certain outer row, the outer row is returned multiple times. Consider the Employees table used earlier in this paper in the Recursive Queries and Common Table Expressions section (Employees Organizational Chart scenario). In the same database, you also create the following Departments table:
CREATE TABLE Departments
(
deptid INT NOT NULL PRIMARY KEY,
deptname VARCHAR(25) NOT NULL,
deptmgrid INT NULL REFERENCES Employees
)
SET NOCOUNT ON
INSERT INTO Departments VALUES(1, 'HR', 2)
INSERT INTO Departments VALUES(2, 'Marketing', 7)
INSERT INTO Departments VALUES(3, 'Finance', 8)
INSERT INTO Departments VALUES(4, 'R&D', 9)
INSERT INTO Departments VALUES(5, 'Training', 4)
INSERT INTO Departments VALUES(6, 'Gardening', NULL)
Most departments have a manager ID that corresponds to an employee in the Employees table, but as in the case of the Gardening department, it is possible for a department to have no manager. Note that a manager in the Employees table does necessarily also manage a department. The following table-valued function accepts an employee ID as an argument and returns that employee and all of his or her subordinates in all levels:
CREATE FUNCTION dbo.fn_getsubtree(@empid AS INT) RETURNS @TREE TABLE
(
empid INT NOT NULL,
empname VARCHAR(25) NOT NULL,
mgrid INT NULL,
lvl INT NOT NULL
)
AS
BEGIN
WITH Employees_Subtree(empid, empname, mgrid, lvl)
AS
(
-- Anchor Member (AM)
SELECT empid, empname, mgrid, 0
FROM employees
WHERE empid = @empid
UNION all
-- Recursive Member (RM)
SELECT e.empid, e.empname, e.mgrid, es.lvl+1
FROM employees AS e
JOIN employees_subtree AS es
ON e.mgrid = es.empid
)
INSERT INTO @TREE
SELECT * FROM Employees_Subtree
RETURN
END
GO
To return all of the subordinates in all levels for the manager of each department, use the following query:
SELECT *
FROM Departments AS D
CROSS APPLY fn_getsubtree(D.deptmgrid) AS ST
Here is the result set:
deptid deptname deptmgrid empid empname mgrid lvl
----------- ---------- ----------- ----------- ---------- ----------- ---
1 HR 2 2 Andrew 1 0
1 HR 2 5 Steven 2 1
1 HR 2 6 Michael 2 1
2 Marketing 7 7 Robert 3 0
2 Marketing 7 11 David 7 1
2 Marketing 7 12 Ron 7 1
2 Marketing 7 13 Dan 7 1
2 Marketing 7 14 James 11 2
3 Finance 8 8 Laura 3 0
4 R&D 9 9 Ann 3 0
5 Training 4 4 Margaret 1 0
5 Training 4 10 Ina 4 1
There are two things to notice here. First, each row from Departments is duplicated as many times as there are rows returned from fn_getsubtree for the department's manager. Second, the Gardening department does not appear in the result because fn_getsubtree returned an empty set for it.
Another practical use of the CROSS APPLY operator answers a common request: returning n rows for each group. For example, the following function returns the requested number of most recent orders for a given customer:
USE AdventureWorks
GO
CREATE FUNCTION fn_getnorders(@custid AS INT, @n AS INT)
RETURNS TABLE
AS
RETURN
SELECT TOP(@n) *
FROM Sales.SalesOrderHeader
WHERE CustomerID = @custid
ORDER BY OrderDate DESC
GO
Using the CROSS APPLY operator, you can get the two most recent orders for each customer by using the following simple query:
SELECT O.*
FROM Sales.Customer AS C
CROSS APPLY fn_getnorders(C.CustomerID, 2) AS O
For more information about the enhancement to TOP, see "TOP Enhancements" later in this paper.
OUTER APPLY is very similar to CROSS APPLY, but it also returns rows from the outer table for which the table-valued function returned an empty set. Null values are returned as the column values that correspond to the columns of the table-valued function. For example, revise the query against the Departments table from the previous section to use OUTER APPLY instead of CROSS APPLY and notice the last row in the output:
SELECT *
FROM Departments AS D
OUTER APPLY fn_getsubtree(D.deptmgrid) AS ST
Here is the result set:
deptid deptname deptmgrid empid empname mgrid lvl
----------- ---------- ----------- ----------- ---------- ----------- ---
1 HR 2 2 Andrew 1 0
1 HR 2 5 Steven 2 1
1 HR 2 6 Michael 2 1
2 Marketing 7 7 Robert 3 0
2 Marketing 7 11 David 7 1
2 Marketing 7 12 Ron 7 1
2 Marketing 7 13 Dan 7 1
2 Marketing 7 14 James 11 2
3 Finance 8 8 Laura 3 0
4 R&D 9 9 Ann 3 0
5 Training 4 4 Margaret 1 0
5 Training 4 10 Ina 4 1
6 Gardening NULL NULL NULL NULL NULL
In SQL Server 2000, you cannot refer to table-valued functions within a correlated subquery. In conjunction with providing the APPLY relational operator, this restriction is removed in SQL Server 2005 Beta 2. Now, within a subquery, you can provide a table-valued function with columns from the outer query as arguments. For example, if you want to return only those departments whose manager has at least three employees, you can write the following query:
SELECT *
FROM Departments AS D
WHERE (SELECT COUNT(*)
FROM fn_getsubtree(D.deptmgrid)) >= 3
deptid deptname deptmgrid
----------- ------------------------- -----------
1 HR 2
2 Marketing 7
ANSI SQL defines four possible referential actions in support of a FOREIGN KEY constraint. You specify those actions that indicate how you want your system to react in response to a DELETE or UPDATE operation against a table that is referenced by a foreign key. SQL Server 2000 supports two of those actions: NO ACTION and CASCADE. SQL Server 2005 Beta 2 adds support for the SET DEFAULT and SET NULL referential actions.
The SET DEFAULT and SET NULL referential actions extend declarative referential integrity (DRI) capabilities. You use these options in conjunction with the ON UPDATE and ON DELETE clauses in a foreign key declaration. SET DEFAULT means that when, in a referenced table, you delete rows (ON DELETE) or update the referenced key (ON UPDATE), SQL Server sets the referencing column values of the related rows in the referencing table to the default value of the column. Similarly, SQL Server can react by setting the values to NULL if you use the SET NULL option, provided that the referencing column allows null values.
For example, the following Customers table has three real customers and a dummy customer:
CREATE TABLE Customers
(
customerid CHAR(5) NOT NULL,
/* other columns */
CONSTRAINT PK_Customers PRIMARY KEY(customerid)
)
INSERT INTO Customers VALUES('DUMMY')
INSERT INTO Customers VALUES('FRIDA')
INSERT INTO Customers VALUES('GNDLF')
INSERT INTO Customers VALUES('BILLY')
The Orders table keeps track of orders. An order does not necessarily have to be assigned to a real customer. If you enter an order and do not specify a customer ID, the DUMMY customer ID is assigned to the order by default. Upon a delete from the Customers table, you want SQL Server to set NULL in the customerid column of the related rows in Orders. Orders with NULL in the customerid column become "orphans," that is, they belong to no customer. Suppose that you also want to allow updates to the customerid column in Customers. You might want to cascade the update to the related rows in Orders, but suppose that a business rule in your company dictates otherwise: orders belonging to a customer whose ID was changed should be related to the default customer (DUMMY). Upon an update to the customerid column in Customers, you want SQL Server to set the default value 'DUMMY' to the related customer IDs (customerid) in Orders. You create the Orders table with a foreign key as follows and populate it with some orders:
CREATE TABLE Orders
(
orderid INT NOT NULL,
customerid CHAR(5) NULL DEFAULT('DUMMY'),
orderdate DATETIME NOT NULL,
CONSTRAINT PK_Orders PRIMARY KEY(orderid),
CONSTRAINT FK_Orders_Customers
FOREIGN KEY(customerid)
REFERENCES Customers(customerid)
ON DELETE SET NULL
ON UPDATE SET DEFAULT
)
INSERT INTO Orders VALUES(10001, 'FRIDA', '20040101')
INSERT INTO Orders VALUES(10002, 'FRIDA', '20040102')
INSERT INTO Orders VALUES(10003, 'BILLY', '20040101')
INSERT INTO Orders VALUES(10004, 'BILLY', '20040103')
INSERT INTO Orders VALUES(10005, 'GNDLF', '20040104')
INSERT INTO Orders VALUES(10006, 'GNDLF', '20040105')
To test the SET NULL and SET DEFAULT options, issue the following DELETE and UPDATE statements:
DELETE FROM Customers
WHERE customerid = 'FRIDA'
UPDATE Customers
SET customerid = 'DOLLY'
WHERE customerid = 'BILLY'
As a result, FRIDA's orders are assigned with null values in the customerid column, and BILLY's orders with DUMMY:
orderid customerid orderdate
----------- ---------- ----------------------
10001 NULL 1/1/2004 12:00:00 AM
10002 NULL 1/2/2004 12:00:00 AM
10003 DUMMY 1/1/2004 12:00:00 AM
10004 DUMMY 1/3/2004 12:00:00 AM
10005 GNDLF 1/4/2004 12:00:00 AM
10006 GNDLF 1/5/2004 12:00:00 AM
Note that if you use the SET DEFAULT option and the referencing column has a non-null default value that does not have a corresponding value in the referenced table, you will get an error when issuing the triggering action. For example, if you delete the DUMMY customer from Customers and then update GNDLF's customerid to GLDRL, you will get an error. The UPDATE triggers a SET DEFAULT action that attempts to assign GNDLF's original orders with the DUMMY customer ID that does not have a corresponding row in Customers:
DELETE FROM Customers
WHERE customerid = 'DUMMY'
UPDATE Customers
SET customerid = 'GLDRL'
WHERE customerid = 'GNDLF'
.Net SqlClient Data Provider: Msg 547, Level 16, State 0, Line 1
UPDATE statement conflicted with COLUMN FOREIGN KEY constraint 'FK_Orders_Customers'. The conflict occurred in database 'tempdb', table 'Customers', column 'customerid'.
The statement has been terminated.
You can find more information about foreign keys, including their defined referential actions, by viewing sys.foreign_keys.
This section covers enhancements that address performance issues in previous versions of SQL Server, increase your data loading capabilities, and dramatically improve your error management capabilities. These enhancements include BULK rowset provider and the TRY...CATCH error handling construct.
BULK is a new rowset provider specified in the OPENROWSET function allowing you to access file data in a relational format. In order to retrieve data from a file, you specify the BULK option, the name of the file, and a format file created either with bcp.exe or manually. You can specify names for the result columns in the parentheses following the alias of the table returned from OPENROWSET.
Here is the new syntax of all the options that you can specify with OPENROWSET:
OPENROWSET
( { 'provider_name'
, { 'datasource' ; 'user_id' ; 'password' | 'provider_string' }
, { [ catalog. ] [ schema. ] object | 'query' }
| BULK 'data_filename',
{FORMATFILE = 'format_file_path' [, <bulk_options>] |
SINGLE_BLOB | SINGLE_CLOB | SINGLE_NCLOB}
}
)
<bulk_options> ::=
[ , CODEPAGE = 'ACP' | 'OEM' | 'RAW' | 'code_page' ]
[ , FIRSTROW = first_row ]
[ , LASTROW = last_row ]
[ , ROWS_PER_BATCH = 'rows_per_batch']
[ , MAXERRORS = 'max_errors']
[ , ERRORFILE ='file_name']
}
)
For example, the following query returns three columns from the text file 'c:\temp\textfile1.txt' and provides the column aliases col1, col2, and col3 to the result columns:
SELECT col1, col2, col3
FROM OPENROWSET(BULK 'c:\temp\textfile1.txt',
FORMATFILE = 'c:\temp\textfile1.fmt') AS C(col1, col2, col3)
Note that when you use the BULK option, a format file must also be specified unless you use the SINGLE_BLOB, SINGLE_CLOB, or SINGLE_NCLOB options which I'll describe later. Therefore, there's no need to specify a data file type, field terminator, or row terminator. Other options that you can optionally specify along with the FORMATFILE include: CODEPAGE, FIRSTROW, LASTROW, ROW_PER_BATCH, MAXERRORS, and ERRORFILE. Most of the options were available with the BULK INSERT command in SQL Server 2000. The ERRORFILE option is conceptually new. This file contains zero or more rows that have formatting errors (that is, these rows cannot be converted to an OLEDB rowset) from an input data file. These rows are copied from the data file "as is" into this error file. Once the error is fixed, the data is already in the desired format so it can easily be reloaded using the same commands. The error file is created in the beginning of the command execution. An error will be raised if the file already exists. By looking at the rows in this file, it is easy to identify the rows that failed but there is no way to know the cause of the failure. To address this, a control file is automatically created with the extension .ERROR.txt. This file references each row in ERRORFILE and provides error diagnostics.
You can populate a table with the results returned from OPENROWSET using the BULK rowset provider, and specify table options for the bulk load operation. For example, the following code loads the results of the previous query to the table MyTable, requesting to disable constraint checking in the target table:
INSERT INTO MyTable WITH (IGNORE_CONSTRAINTS)
SELECT col1, col2, col3
FROM OPENROWSET(BULK 'c:\temp\textfile1.txt',
FORMATFILE = 'c:\temp\textfile1.fmt') AS C(col1, col2, col3)
Besides the IGNORE_CONSTRAINTS option, other table hints that you can specify in the load operation include: BULK_KEEPIDENTITY, BULK_KEEPNULLS, and IGNORE_TRIGGERS.
You can also use the BULK provider to return file data as a single column value of a large object type by specifying one of the options: SINGLE_CLOB for character data, SINGLE_NCLOB for Unicode data, and SINGLE_BLOB for binary data. When you use one of these options, you don't specify a format file. You can load a file (using an INSERT or an UPDATE statement) into a large object column of one of the following data types: VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), or XML. You can find details on the MAX specifier for variable-length columns and on the XML data type later in this paper.
As an example of loading a file into a large column, the following UPDATE statement loads the text file 'c:\temp\textfile101.txt' into the column txt_data in the table CustomerData for customer 101.:
UPDATE CustomerData
SET txt_data = (SELECT txt_data FROM OPENROWSET(
BULK 'c:\temp\textfile101.txt', SINGLE_CLOB) AS F(txt_data))
WHERE custid = 101
Note that only one large column can be updated at a time.
The following example shows how to load a binary file for customer 102 into a large column using an INSERT statement:
INSERT INTO CustomerData(custid, binary_data)
SELECT 102 AS custid, binary_data
FROM OPENROWSET(
BULK 'c:\temp\binfile102.dat', SINGLE_BLOB) AS F(binary_data)
SQL Server 2005 Beta 2 introduces a simple but very powerful exception-handling mechanism in the form of a TRY...CATCH Transact-SQL construct.
Previous versions of SQL Server require you to include error-handling code after every statement that was suspect of error. To centralize error-checking code, you must use labels and GOTO statements. Furthermore, errors such as data type conversion errors cause a batch to terminate; therefore, you cannot trap these errors with Transact-SQL. SQL Server 2005 Beta 2 addresses many of these issues.
Errors that used to cause a batch to terminate can now be caught and handled, provided that those errors do not cause severance of the connection (typically errors with severity 21 and up, such as table or database integrity suspect, hardware errors, and so on).
Write the code you want to execute within a BEGIN TRY/END TRY block and follow with the error-handling code in a BEGIN CATCH/END CATCH block. Note that a TRY block must have a corresponding CATCH block; otherwise, you will get a syntax error. As a simple example, consider the following Employees table:
CREATE TABLE Employees
(
empid INT NOT NULL,
empname VARCHAR(25) NOT NULL,
mgrid INT NULL,
/* other columns */
CONSTRAINT PK_Employees PRIMARY KEY(empid),
CONSTRAINT CHK_Employees_empid CHECK(empid > 0),
CONSTRAINT FK_Employees_Employees
FOREIGN KEY(mgrid) REFERENCES Employees(empid)
)
You want to write code that inserts a new employee row into the table. You also want to respond to a failure situation with some corrective activity. Use the new TRY...CATCH construct as follows:
BEGIN TRY
INSERT INTO Employees(empid, empname, mgrid)
VALUES(1, 'Emp1', NULL)
PRINT 'After INSERT.'
END TRY
BEGIN CATCH
PRINT 'INSERT failed.'
/* perform corrective activity */
END CATCH
When you run this code for the first time, you should get the output 'After INSERT.' When you run it for the second time, you should get the output 'INSERT Failed.'
If the code within the TRY block completes with no errors, control is passed to the first statement following the corresponding CATCH block. When a statement within the TRY block fails, control is passed to the first statement within the corresponding CATCH block. Note that if an error is trapped by a CATCH block, it is not returned to the calling application. If you also want the application to get the error information, you must provide the information to the application yourself (for example, using RAISERROR or as a result set of a query). All of the error information is available to you in the CATCH block by means of four new functions: ERROR_NUMBER(), ERROR_MESSAGE(), ERROR_SEVERITY(), and ERROR_STATE(). These functions can be queried multiple times, anywhere you like in the CATCH block, and their values remain the same. This is in contrast to the @@error function which is affected by any statement besides DECLARE, so it must be queried in the first statement of the CATCH block. ERROR_NUMBER() can be used as an alternative to @@error, while the other three functions give you the rest of the information exactly as it was generated by the error. Such info was not available in SQL Server releases earlier than SQL Server 2005.
If an unhandled error is generated in a batch or routine (stored procedure, trigger, user defined function, dynamic code), and an upper level of code invoked that batch or routine within a TRY block, control is passed to the upper level's corresponding CATCH block. If the upper level did not invoke the inner level within a TRY block, SQL Server will keep looking for a TRY block in upper levels in the call stack, and will pass control to the CATCH block of the first TRY...CATCH construct found. If none is found, the error is returned to the calling application.
As a more detailed example, the following code reacts differently depending on the type of error that caused the failure, and also prints messages indicating which parts of the code have been activated:
PRINT 'Before TRY...CATCH block.'
BEGIN TRY
PRINT ' Entering TRY block.'
INSERT INTO Employees(empid, empname, mgrid) VALUES(2, 'Emp2', 1)
PRINT ' After INSERT.'
PRINT ' Exiting TRY block.'
END TRY
BEGIN CATCH
PRINT ' Entering CATCH block.'
IF ERROR_NUMBER() = 2627
BEGIN
PRINT ' Handling PK violation...'
END
ELSE IF ERROR_NUMBER() = 547
BEGIN
PRINT ' Handling CHECK/FK constraint violation...'
END
ELSE IF ERROR_NUMBER() = 515
BEGIN
PRINT ' Handling NULL violation...'
END
ELSE IF ERROR_NUMBER() = 245
BEGIN
PRINT ' Handling conversion error...'
END
ELSE
BEGIN
PRINT ' Handling unknown error...'
END
PRINT ' Error Number: ' + CAST(ERROR_NUMBER() AS VARCHAR(10))
PRINT ' Error Message: ' + ERROR_MESSAGE()
PRINT ' Error Severity: ' + CAST(ERROR_SEVERITY() AS VARCHAR(10))
PRINT ' Error State : ' + CAST(ERROR_STATE() AS VARCHAR(10))
PRINT ' Exiting CATCH block.'
END CATCH
PRINT 'After TRY...CATCH block.'
Notice that the ERROR_NUMBER() function is invoked multiple times in the CATCH block, and it always returns the number of the error that caused control to pass to the CATCH block. This code inserts employee 2 as a subordinate to the previously inserted employee 1 and should complete with no errors when run for the first time, generating the following output:
Before TRY...CATCH block.
Entering TRY block.
After INSERT.
Exiting TRY block.
After TRY...CATCH block.
Notice that the CATCH block was skipped. Running this code for the second time should generate the following output:
Before TRY...CATCH block.
Entering TRY block.
Entering CATCH block.
Handling PK violation...
Error Number: 2627
Error Message: Violation of PRIMARY KEY constraint 'PK_Employees'. Cannot insert duplicate key in object 'Employees'.
Error Severity: 14
Error State : 1
Exiting CATCH block.
After TRY...CATCH block.
Notice that the TRY block was entered but not completed. As a result of the primary key violation error, control was passed to the CATCH block, which identified the error and handled it. Similarly, if you assign a value that is not valid employee ID data such as 0, which violates the CHECK constraint; NULL, which is not allowed in employeeid; and 'a,' which cannot be converted to INT; you get the appropriate error, and the appropriate handling code would be activated.
If you are using explicit transactions in the TRY block, you might want to investigate the transaction state in your error-handling code in the CATCH block to determine a course of action. SQL Server 2005 provides you with the new function XACT_STATE() which returns the transaction state. Possible return values from the function are: 0, -1, and 1. A 0 return value means no transaction is open. An attempt to commit or roll back the transaction would generate an error. A 1 return value means that a transaction is open and can be either committed or rolled back. You need to determine whether to commit or roll back the transaction depending on your needs and your error-handling logic. A -1 return value means that a transaction is open but is in an uncommittable state, which is a new transaction state introduced in SQL Server 2005. A transaction within a TRY block enters an uncommittable state when an error is generated that would otherwise cause the transaction to be aborted (typically, severity 17 or higher). An uncommittable transaction keeps all open locks and allows you only to read data. You cannot submit any activity that requires writes to the transaction log, meaning you cannot change data while the transaction is in an uncommittable state. In order to terminate the transaction, you must issue a rollback. You cannot commit the transaction, you can only roll it back before any modifications would be accepted. The following example demonstrates how to use the XACT_STATE() function:
BEGIN TRY
BEGIN TRAN
INSERT INTO Employees(empid, empname, mgrid) VALUES(3, 'Emp3', 1)
/* other activity */
COMMIT TRAN
PRINT 'Code completed successfully.'
END TRY
BEGIN CATCH
PRINT 'Error: ' + CAST(ERROR_NUMBER() AS VARCHAR(10)) + ' found.'
IF (XACT_STATE()) = -1
BEGIN
PRINT 'Transaction is open but uncommittable.'
/* ...investigate data... */
ROLLBACK TRANSACTION -- can only ROLLBACK
/* ...handle the error... */
END
ELSE IF (XACT_STATE()) = 1
BEGIN
PRINT 'Transaction is open and committable.'
/* ...handle error... */
COMMIT TRANSACTION -- or ROLLBACK
END
ELSE
BEGIN
PRINT 'No open transaction.'
/* ...handle error... */
END
END CATCH
The TRY block submits code within an explicit transaction. It inserts a new employee row and performs some other activity within the same transaction. The CATCH block prints the error number and investigates the transaction state to determine a course of action. If the transaction is open and uncommittable, the CATCH block investigates the data, rolls back the transaction, and then takes any corrective measures that require data modifications. If the transaction is open and committable, the CATCH block handles the error and commits (or could possibly roll back). If no transaction is open, the error is handled. No commit or rollback is issued. If you run this code for the first time, a new employee row for employee 3 is inserted and the code completes successfully with the following output:
Code completed successfully.
If you run this code for the second time, a primary key violation error is generated, and you get the following output:
Error: 2627 found.
Transaction is open and committable.
This section briefly describes other enhancements in SQL Server 2005 Beta 2 that affect Transact-SQL. These include enhancements to TOP, data manipulation language (DML) with results, MAX specifier for dynamic columns, XML/XQuery, data definition language (DDL) triggers, Queuing and SQL Server Service Broker, and DML events and notifications.
In SQL Server version 7.0 and SQL Server 2000, the TOP option allows you to limit the number or percentage of rows returned by a SELECT query; however, you must provide a constant as an argument. In SQL Server 2005 Beta 2, TOP is enhanced in the following major ways:
- You can now specify a numeric expression returning the number or percentage of rows to be affected by your query, optionally using variables and subqueries.
- You can now use the TOP option in DELETE, UPDATE, and INSERT queries.
The new syntax for queries using the TOP option is:
SELECT [TOP (<expression>) [PERCENT] [WITH TIES]]
FROM <table_name>...[ORDER BY...]
DELETE [TOP (<expression>) [PERCENT]] FROM <table_name> ...
UPDATE [TOP (<expression>) [PERCENT]] <table_name> SET ...
INSERT [TOP (<expression>) [PERCENT]] INTO <table_name> ...
The numeric expression must be specified in parentheses. Specifying constants without parentheses is supported in SELECT queries only for backward compatibility. The expression must be self-contained—if you use a subquery, it cannot refer to columns of the table in the outer query. If you do not specify the PERCENT option, the expression must be implicitly convertible to the bigint data type. If you specify the PERCENT option, the expression must be implicitly convertible to float and fall within the range of 0 to 100. The WITH TIES option and the ORDER BY clause are supported only with SELECT queries.
The following code, for example, uses a variable as an argument to the TOP option, returning the specified number of most recent purchase orders:
USE AdventureWorks
DECLARE @n AS BIGINT
SET @n = 2
SELECT TOP(@n) *
FROM Purchasing.PurchaseOrderHeader
ORDER BY OrderDate DESC
This enhancement is especially useful when you get the number of requested rows as an argument to a stored procedure or a user-defined function. By using a self-contained subquery, you can answer dynamic requests such as "calculate the average number of monthly orders and return that many orders which are most recent:"
USE AdventureWorks
SELECT TOP(SELECT
COUNT(*)/DATEDIFF(month, MIN(OrderDate), MAX(OrderDate))
FROM Purchasing.PurchaseOrderHeader) *
FROM Purchasing.PurchaseOrderHeader
ORDER BY OrderDate DESC
The SET ROWCOUNT option in earlier versions of SQL Server allows you to limit the number of rows affected by a query. For example, SET ROWCOUNT is commonly used to periodically purge large amounts of data in several small transactions instead of a single large transaction:
SET ROWCOUNT 1000
DELETE FROM BigTable WHERE datetimecol < '20000101'
WHILE @@rowcount > 0
DELETE FROM BigTable WHERE datetimecol < '20000101'
SET ROWCOUNT 0
Using SET ROWCOUNT in this way allows the backing up and recycling of the transaction log during the purging process, and can also prevent lock escalation. Instead of using SET ROWCOUNT, you can now use TOP this way:
DELETE TOP(1000) FROM BigTable WHERE datetimecol < '20000101'
WHILE @@rowcount > 0
DELETE TOP(1000) FROM BigTable WHERE datetimecol < '20000101'
When you use the TOP option, the optimizer can tell what the "row-goal" is and whether TOP is used at all, allowing the optimizer to produce more efficient plans.
Although you might think that using TOP in INSERT statements is not required because you can always specify it in the SELECT query, you might find it useful when inserting the result of an EXEC command or the result of a UNION operation, for example:
INSERT TOP ... INTO ...
EXEC ...
INSERT TOP ... INTO ...
SELECT ... FROM T1
UNION ALL
SELECT ... FROM T2
ORDER BY ...
SQL Server 2005 introduces a new OUTPUT clause that allows you to return data from a modification statement (INSERT, UPDATE, DELETE) into a table variable. Useful scenarios for DML with results include purging and archiving, message-processing applications, and others as well. The syntax of the new OUTPUT clause is:
OUTPUT <dml_select_list> INTO @table_variable
You access the old/new image of the modified rows by referring to inserted and deleted tables similar to the way you do with triggers. In an INSERT statement, you are only allowed to access the inserted table. In a DELETE statement, you are only allowed to access the deleted table. In an UPDATE statement, you are allowed to access both the inserted and the deleted tables.
As an example of a purging and archiving scenario where DML with results can be useful, suppose you have a large Orders table, and that you periodically want to purge historic data. You also want to copy the purged data into an archive table called OrdersArchive. You declare a table variable called @DeletedOrders and enter a loop in which you delete historic data (say orders earlier than 2003) in chunks using the purging method described in the "TOP Enhancements" section earlier in this paper. The addition here is the OUTPUT clause which copies all attributes of all deleted rows into the @DeletedOrders table variable, and then, using an INSERT INTO statement, copies all rows from the table variable into the OrdersArchive table:
DECLARE @DeletedOrders TABLE
(
orderid INT,
orderdate DATETIME,
empid INT,
custid VARCHAR(5),
qty INT
)
WHILE 1=1
BEGIN
BEGIN TRAN
DELETE TOP(5000) FROM Orders
OUTPUT deleted.* INTO @DeletedOrders
WHERE orderdate < '20030101'
INSERT INTO OrdersArchive
SELECT * FROM @DeletedOrders
COMMIT TRAN
DELETE FROM @DeletedOrders
IF @@rowcount < 5000
BREAK
END
As an example of a message-processing scenario, consider the following Messages table:
USE tempdb
CREATE TABLE Messages
(
msgid INT NOT NULL IDENTITY ,
msgdate DATETIME NOT NULL DEFAULT(GETDATE()),
msg VARCHAR(MAX) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT('new'),
CONSTRAINT PK_Messages
PRIMARY KEY NONCLUSTERED(msgid),
CONSTRAINT UNQ_Messages_status_msgid
UNIQUE CLUSTERED(status, msgid),
CONSTRAINT CHK_Messages_status
CHECK (status IN('new', 'open', 'done'))
)
For each message, you store a message ID, entry date, message text, and a status indicating whether the message wasn't processed yet ("new"), is being processed ("open"), or is already processed ("done").
The following code simulates a session that generates messages by using a loop that inserts a message with random text every second. The status of newly inserted messages is "new" since the status column was assigned with the default value 'new.' Run this code from multiple sessions at the same time:
USE tempdb
SET NOCOUNT ON
DECLARE @msg AS VARCHAR(MAX)
WHILE 1=1
BEGIN
SET @msg = 'msg' + RIGHT('000000000'
+ CAST(CAST(RAND()*2000000000 AS INT)+1 AS VARCHAR(10)), 10)
INSERT INTO dbo.Messages(msg) VALUES(@msg)
WAITFOR DELAY '00:00:01';
END
The following code simulates a session that processes messages using the following steps:
- Form an endless loop that constantly processes messages.
- Lock one available new message using an UPDATE TOP(1) statement with the READPAST hint to skip locked rows, and change its status to "open."
- Store the message attributes in the @Msgs table variable using the OUTPUT clause.
- Process the message.
- Set the message status to "done" by joining the Messages table and the @Msgs table variable.
- If no new message was found in the Messages table, wait for one second.
Run this code from multiple sessions:
USE tempdb
SET NOCOUNT ON
DECLARE @Msgs TABLE(msgid INT, msgdate DATETIME, msg VARCHAR(MAX))
WHILE 1 = 1
BEGIN
UPDATE TOP(1) Messages WITH(READPAST) SET status = 'open'
OUTPUT inserted.msgid, inserted.msgdate, inserted.msg
INTO @Msgs
WHERE status = 'new'
IF @@rowcount > 0
BEGIN
PRINT 'Processing message...'
-- process message here
SELECT * FROM @msgs
UPDATE M
SET status = 'done'
FROM Messages AS M
JOIN @Msgs AS N
ON M.msgid = N.msgid;
DELETE FROM @Msgs
END
ELSE
BEGIN
PRINT 'No messages to process.'
WAITFOR DELAY '00:00:01'
END
END
Once you are done running the simulation, stop all sessions that insert and process messages and drop the Messages table:
USE tempdb
DROP TABLE Messages
SQL Server 2005 enhances the capabilities of the variable-length data types VARCHAR, NVARCHAR, and VARBINARY by introducing the MAX specifier using the syntax <datatype>(MAX). A variable-length data type with the MAX specifier substitutes the data types TEXT, NTEXT, and IMAGE with enhanced functionality. There are several advantages of using a variable-length data type with the MAX specifier as a substitute for the large object data types TEXT, NTEXT, and IMAGE. There's no need for explicit pointer manipulation because SQL Server internally determines when to store values inline and when to use a pointer. You are now able to use a unified programming model for small and large data. Variable-length data types with the MAX specifier are supported by columns, variables, parameters, comparisons, triggers, all string functions, etc.
As an example of using the MAX specifier, the following code creates a table called CustomerData:
CREATE TABLE CustomerData
(
custid INT NOT NULL PRIMARY KEY,
txt_data VARCHAR(MAX) NULL,
ntxt_data NVARCHAR(MAX) NULL,
binary_data VARBINARY(MAX) NULL
)
The table contains the column custid, which is used as the primary key, and the nullable columns txt_data, ntxt_data, and binary_data defined with the data types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX) respectively where you can store large data.
In order to read a chunk from a dynamic column with the MAX specifier, you use the SUBSTRING function the same way you do with regular dynamic columns. In order to update a chunk, you use an enhanced syntax of the UPDATE statement, which now provides you with a WRITE method. The syntax of the enhanced UPDATE statement is:
UPDATE table_name
SET column_name.WRITE(@chunk, @offset, @len)
WHERE ...
The WRITE method removes @len characters from @offset position, and inserts @chunk at that position. Note that @offset is zero-based, meaning that offset 0 represents the position of the first character in @chunk. To demonstrate the use of the WRITE method, first insert a row in the CustomerData table, with customer ID 102, and the value 'Customer 102 text data' in the txt_data column:
INSERT INTO CustomerData(custid,txt_data)
VALUES(102, 'Customer 102 text data')
The following UPDATE statement replaces '102' with 'one hundred and two':
UPDATE CustomerData
SET txt_data.WRITE('one hundred and two', 9, 3)
WHERE custid = 102
When @chunk is NULL, @len is ignored, and value is truncated at the @offset position. The following statement removes all data from offset 28 till the end:
UPDATE CustomerData
SET txt_data.WRITE(NULL, 28, 0)
WHERE custid = 102
When @len is NULL, all characters from @offset to the end are removed, and @chunk is appended. The following statement removes all data from offset 9 till the end and appends '102':
UPDATE CustomerData
SET txt_data.WRITE('102', 9, NULL)
WHERE custid = 102
When @offset is NULL, @len is ignored, and @chunk is appended at the end. The following statement appends the string ' is discontinued' at the end:
UPDATE CustomerData
SET txt_data.WRITE(' is discontinued', NULL, 0)
WHERE custid = 102
SQL Server 2005 Beta 2 introduces several XML-related enhancements that allow you to store, query, and update XML-structured data natively. You can store both XML and relational data in the same database, leveraging the existing database engine for storage and query processing.
A new xml data type is introduced. The xml data type can be used for table columns and can even be indexed. The xml data type can also be used in variables, views, functions, and stored procedures. The xml data type can be generated by relational FOR XML queries or accessed as a relational rowset using OPENXML. You can import schemas into your database or export schemas out of it. You can use schemas to validate and constrain your XML data. You can query and modify XML-typed data by using XQuery. The xml data type is supported in triggers, replication, bulk copy, DBCC, and full-text search. However, xml is not comparable, meaning that you cannot define a PRIMARY KEY, UNIQUE, or FOREIGN KEY constraint on an xml column.
The following examples use the xml data type. The following code defines an XML variable called @x and loads customer order data into it:
USE AdventureWorks
DECLARE @x AS XML
SET @x = (SELECT C.CustomerID, O.SalesOrderID
FROM Sales.Customer C
JOIN Sales.SalesOrderHeader O
ON C.CustomerID=O.CustomerID
ORDER BY C.CustomerID
FOR XML AUTO, TYPE)
SELECT @x
The following code creates a table with an xml column and bulk loads an XML file into it by using the OPENROWSET function:
CREATE TABLE T1
(
keycol INT NOT NULL PRIMARY KEY,
xmldoc XML NULL
)
INSERT INTO T1(keycol, xmldoc)
SELECT 1 AS keycol, xmldoc
FROM OPENROWSET(BULK 'C:\documents\mydoc.xml', SINGLE_NCLOB)
AS X(xmldoc)
SQL Server 2005 Beta 2 also introduces support for XQuery, which is a W3C standard XML query language. Microsoft provides extensions to the standard in SQL Server that allow using XQuery for inserts, updates, and deletes. XQuery is embedded with Transact-SQL by means of user-defined type (UDT) style methods.
XQuery provides the following querying methods:
- Manipulating XML data: @x.query (xquery string) returning XML
- Checking existence: @x.exist (xquery string) returning bit
- Returning a scalar value @x.value (xquery string, sql_type string) returning sql_type
XQuery provides the following modification method: @x.modify (xDML string).
As an example, a table called Jobs contains XML-formatted job information in a column called jobinfo. The following query returns the ID and XML data after some manipulation for every row that meets some criteria. The method jobinfo.exist() is invoked in the WHERE clause to filter the rows you want. It returns 1 only for rows whose jobinfo column contains the edited element with a date attribute that is greater than the Transact-SQL variable @date. For each returned row, an XML result is generated by invoking the jobinfo.query() method. For every job element found in jobinfo, the query() method generates a jobschedule element, with an id attribute based on the id attribute of the job, and begin and end subelements with data based on the start and end attributes from jobinfo:
SELECT id, jobinfo.query(
'for $j in //job
return
<jobschedule id="{$j/@id}">
<begin>{data($j/@start)}</begin>
<end>{data($j/@end)}</end>
</jobschedule>')
FROM Jobs
WHERE 1 = jobinfo.exist(
'//edited[@date > sql:variable("@date")]')
The following invocation of the value() method returns the start attribute of the first job in jobinfo in a Transact-SQL datetime format:
SELECT id, jobinfo.value('(//job)[1]/@start', 'DATETIME') AS startdt
FROM Jobs
WHERE id = 1
XQuery can also be used to modify data. For example, you use the following code to update the empinfo XML column in the Employees table for employee 1. You update the date attribute of the edited subelement of the resume element to a new value:
UPDATE Employees SET empinfo.modify(
'update /resume/edited/@date
to xs:date("2000-6-20")')
WHERE empid = 1
In earlier versions of SQL Server, you can define AFTER triggers only for DML statements (INSERT, UPDATE, and DELETE) issued against a table. SQL Server 2005 Beta 2 allows you to define triggers for DDL events with a scope of the entire server or database. You can define a DDL trigger for an individual DDL statement such as CREATE_TABLE, or for a group of statements such as DDL_DATABASE_LEVEL_EVENTS. Within the trigger, you can get data regarding the event that fired it by accessing the eventdata() function. This function returns XML data about the event. The schema for each event inherits the Server Events base schema.
Event information includes:
- When the event took place.
- The SPID from which the event was issued.
- The type of event.
- The affected object.
- SET options.
- The Transact-SQL statement that fired it.
Similar to triggers in earlier versions of SQL Server, DDL triggers run in the context of the transaction that fired them. If you decide to undo the event that fired the trigger, you can issue a ROLLBACK statement. For example, the following trigger prevents new tables from being created in the current database:
CREATE TRIGGER trg_capture_create_table ON DATABASE FOR CREATE_TABLE
AS
-- PRINT event information For DEBUG
PRINT 'CREATE TABLE Issued'
PRINT EventData()
-- Can investigate data returned by EventData() and react accordingly.
RAISERROR('New tables cannot be created in this database.', 16, 1)
ROLLBACK
GO
If you issue a CREATE TABLE statement within the database in which the trigger was created, you should get the following output:
CREATE TABLE T1(col1 INT)
CREATE TABLE Issued
<EVENT_INSTANCE>
<PostTime>2003-04-17T13:55:47.093</PostTime>
<SPID>53</SPID>
<EventType>CREATE_TABLE</EventType>
<Database>testdb</Database>
<Schema>dbo</Schema>
<Object>T1</Object>
<ObjectType>TABLE</ObjectType>
<TSQLCommand>
<SetOptions ANSI_NULLS="ON" ANSI_NULL_DEFAULT="ON" ANSI_PADDING="ON"
QUOTED_IDENTIFIER="ON" ENCRYPTED="FALSE" />
<CommandText>CREATE TABLE T1(col1 INT)</CommandText>
</TSQLCommand>
</EVENT_INSTANCE>
.Net SqlClient Data Provider: Msg 50000, Level 16, State 1, Procedure trg_capture_create_table, Line 10
New tables cannot be created in this database.
.Net SqlClient Data Provider: Msg 3609, Level 16, State 1, Line 1
Transaction ended in trigger. Batch has been aborted.
Note that the XML output was manually formatted in this paper for readability purposes. You would get unformatted XML output when you run this code.
To drop the trigger, issue the following statement:
DROP TRIGGER trg_capture_create_table ON DATABASE
Scenarios where DDL triggers are especially useful include integrity checks for DDL changes and auditing scenarios, among others. As an example of DDL integrity enforcement, the following database-level trigger rejects attempts to create a table without a primary key:
CREATE TRIGGER trg_create_table_with_pk ON DATABASE FOR CREATE_TABLE
AS
DECLARE @eventdata AS XML, @objectname AS NVARCHAR(257),
@msg AS NVARCHAR(500)
SET @eventdata = eventdata()
SET @objectname =
N'[' + CAST(@eventdata.query('data(//SchemaName)') AS SYSNAME)
+ N'].[' +
CAST(@eventdata.query('data(//ObjectName)') AS SYSNAME) + N']'
IF OBJECTPROPERTY(OBJECT_ID(@objectname), 'TableHasPrimaryKey') = 0
BEGIN
SET @msg = N'Table ' + @objectname + ' does not contain a primary key.'
+ CHAR(10) + N'Table creation rolled back.'
RAISERROR(@msg, 16, 1)
ROLLBACK
RETURN
END
The trigger is fired when a CREATE TABLE statement is issued. Using XQuery, the trigger extracts the schema and object names and, using the OBJECTPROPERTY function, checks whether the table contains a primary key. If it doesn't, the trigger generates an error and rolls back the transaction. After creating the trigger, the following attempt to create a table without a primary key fails:
CREATE TABLE T1(col1 INT NOT NULL)
Msg 50000, Level 16, State 1, Procedure trg_create_table_with_pk, Line 19
Table [dbo].[T1] does not contain a primary key.
Table creation rolled back.
Msg 3609, Level 16, State 2, Line 1
Transaction ended in trigger. Batch has been aborted.
And the following succeeds:
CREATE TABLE T1(col1 INT NOT NULL PRIMARY KEY)
To drop the trigger and the table T1, run the following code:
DROP TRIGGER trg_create_table_with_pk ON DATABASE
DROP TABLE T1
As an example of an auditing trigger, the following database-level trigger audits all DDL statements to the AuditDDLEvents table:
CREATE TABLE AuditDDLEvents
(
LSN INT NOT NULL IDENTITY,
posttime DATETIME NOT NULL,
eventtype SYSNAME NOT NULL,
loginname SYSNAME NOT NULL,
schemaname SYSNAME NOT NULL,
objectname SYSNAME NOT NULL,
targetobjectname SYSNAME NOT NULL,
eventdata XML NOT NULL,
CONSTRAINT PK_AuditDDLEvents PRIMARY KEY(LSN)
)
GO
CREATE TRIGGER trg_audit_ddl_events ON DATABASE FOR DDL_DATABASE_LEVEL_EVENTS
AS
DECLARE @eventdata AS XML
SET @eventdata = eventdata()
INSERT INTO dbo.AuditDDLEvents(
posttime, eventtype, loginname, schemaname,
objectname, targetobjectname, eventdata)
VALUES(
CAST(@eventdata.query('data(//PostTime)') AS VARCHAR(23)),
CAST(@eventdata.query('data(//EventType)') AS SYSNAME),
CAST(@eventdata.query('data(//LoginName)') AS SYSNAME),
CAST(@eventdata.query('data(//SchemaName)') AS SYSNAME),
CAST(@eventdata.query('data(//ObjectName)') AS SYSNAME),
CAST(@eventdata.query('data(//TargetObjectName)') AS SYSNAME),
@eventdata)
GO
The trigger simply extracts all event attributes of interest from the eventdata() function using XQuery, and inserts those into the AuditDDLEvents table. To test the trigger, submit a few DDL statements and query the audit table:
CREATE TABLE T1(col1 INT NOT NULL PRIMARY KEY)
ALTER TABLE T1 ADD col2 INT NULL
ALTER TABLE T1 ALTER COLUMN col2 INT NOT NULL
CREATE NONCLUSTERED INDEX idx1 ON T1(col2)
SELECT * FROM AuditDDLEvents
To check who changed the schema of table T1 in the last 24 hours and how they changed it, run the following query:
SELECT posttime, eventtype, loginname,
CAST(eventdata.query('data(//TSQLCommand)') AS NVARCHAR(2000))
AS tsqlcommand
FROM dbo.AuditDDLEvents
WHERE schemaname = N'dbo' AND N'T1' IN(objectname, targetobjectname)
ORDER BY posttime
To drop the trigger and the tables you just created, run the following code:
DROP TRIGGER trg_audit_ddl_events ON DATABASE
DROP TABLE dbo.T1
DROP TABLE dbo.AuditDDLEvents
As an example of a server-level audit trigger, the following trigger audits all DDL login-related events to an audit table called AuditDDLLogins:
USE master
CREATE TABLE dbo.AuditDDLLogins
(
LSN INT NOT NULL IDENTITY,
posttime DATETIME NOT NULL,
eventtype SYSNAME NOT NULL,
loginname SYSNAME NOT NULL,
objectname SYSNAME NOT NULL,
logintype SYSNAME NOT NULL,
eventdata XML NOT NULL,
CONSTRAINT PK_AuditDDLLogins PRIMARY KEY(LSN)
)
CREATE TRIGGER audit_ddl_logins ON ALL SERVER
FOR DDL_LOGIN_EVENTS
AS
DECLARE @eventdata AS XML
SET @eventdata = eventdata()
INSERT INTO master.dbo.AuditDDLLogins(
posttime, eventtype, loginname,
objectname, logintype, eventdata)
VALUES(
CAST(@eventdata.query('data(//PostTime)') AS VARCHAR(23)),
CAST(@eventdata.query('data(//EventType)') AS SYSNAME),
CAST(@eventdata.query('data(//LoginName)') AS SYSNAME),
CAST(@eventdata.query('data(//ObjectName)') AS SYSNAME),
CAST(@eventdata.query('data(//LoginType)') AS SYSNAME),
@eventdata)
GO
To test the trigger, issue the following DDL login statements which create, alter, and drop a login, then query the audit table:
CREATE LOGIN login1 WITH PASSWORD = '123'
ALTER LOGIN login1 WITH PASSWORD = 'xyz'
DROP LOGIN login1
SELECT * FROM AuditDDLLogins
To drop the trigger and the audit table, run the following code:
DROP TRIGGER audit_ddl_logins ON ALL SERVER
DROP TABLE dbo.AuditDDLLogins
DROP DATABASE testdb
SQL Server 2005 Beta 2 allows you to capture DDL and system events and send an event notification to a Service Broker deployment. Whereas triggers are processed synchronously, event notifications are an event delivery mechanism that allows asynchronous consumption. An event notification sends XML data to a specified Service Broker service, and event consumers consume it asynchronously. An event consumer can wait for new data to arrive using extensions to the WAITFOR clause.
An event notification is defined by:
- Scope (SERVER, DATABASE, ASSEMBLY, individual object)
- List of events or event groups (for example, CREATE_TABLE, DDL_EVENTS, and so on)
- Deployment name that implements the SQL Server Events message type and contract
Event data is sent in XML format using the SQL Server Events schema. The general syntax for creating an event notification is:
CREATE EVENT NOTIFICATION <name>
ON <scope>
FOR <list_of_event_or_event_groups>
TO SERVICE <deployment_name>
When an event notification is created, a Service Broker conversation is established between a system deployment and the deployment specified by the user. The <deployment_name> specifies the Service Broker with which SQL Server opens a conversation to deliver data about an event. The deployment specified must implement the SQL Server Events message type and contract. When an event occurs for which an event notification exists, an XML message is constructed from the pertinent event data and sent through the event notification's conversation to the specified deployment.
For example, the following code creates a table called T1 and defines an event notification that sends a notice to a certain deployment every time the T1 table's schema is altered:
CREATE TABLE dbo.T1(col1 INT);
GO
-- Create a queue.
CREATE QUEUE SchemaChangeQueue;
GO
--Create a service on the queue that references
--the event notifications conract.
CREATE SERVICE SchemaChangeService
ON QUEUE SchemaChangeQueue
(
[//s.ms.net/SQL/Notifications/PostEventNotification/v1.0]
);
GO
--Create a route on the service to define the address
--to which Service Broker sends messages for the service.
CREATE ROUTE SchemaChangeRoute
WITH SERVICE_NAME = 'SchemaChangeService',
ADDRESS = 'LOCAL';
GO
--Create the event notification.
CREATE EVENT NOTIFICATION NotifySchemaChangeT1
ON TABLE dbo.T1
FOR ALTER_TABLE TO SERVICE [SchemaChangeService];
The following ALTER would then cause an XML message to be sent to SchemaChangeService, which is built on the SchemaChangeQueue:
ALTER TABLE dbo.T1 ADD col2 INT;
The XML message could then be retrieved from the queue with the following statement:
RECEIVE TOP (1) CAST(message_body AS nvarchar(MAX))
FROM SchemaChangeQueue
The resulting output would be as follows (minus the formatting):
<EVENT_INSTANCE>
<PostTime>2004-06-15T11:16:32.963</PostTime>
<SPID>55</SPID>
<EventType>ALTER_TABLE</EventType>
<ServerName>MATRIX\S1</ServerName>
<LoginName>MATRIX\Gandalf</LoginName>
<UserName>MATRIX\Gandalf</UserName>
<DatabaseName>testdb</DatabaseName>
<SchemaName>dbo</SchemaName>
<ObjectName>T1</ObjectName>
<ObjectType>TABLE</ObjectType>
<TSQLCommand>
<SetOptions ANSI_NULLS="ON" ANSI_NULL_DEFAULT="ON" ANSI_PADDING="ON" QUOTED_IDENTIFIER="ON" ENCRYPTED="FALSE" />
<CommandText>ALTER TABLE dbo.T1 ADD col2 INT;</CommandText>
</TSQLCommand>
</EVENT_INSTANCE>
The WAITFOR statement can be used to receive notifications in blocking mode as follows:
WAITFOR (RECEIVE * FROM myQueue)
The Transact-SQL enhancements in SQL Server 2005 Beta 2 increase your expressive powers in query writing, allow you to improve the performance of your code, and extend your error management capabilities. The continuous effort that is being put into enhancing Transact-SQL shows a firm belief in its significant role in SQL Server, its power, and in its future.