Filtering SQL Queries on XML Columns

Makale
05/20/2006

Author: Brian Otto

Storing xml data in the database is now very easy with the introduction of the xml data type. Xml data is inherently more complex than scalar values so when it comes time to retrieve data out of a table based on information contained in an xml column, it is just as important if not more important that this be performed efficiently.

SQL Server provides the ability to index your XML data. I won’t go into detail about the differences between the primary and secondary xml indexes. For more information on xml indexes refer to the References at the end. What I’ll explore below is how to construct SQL queries that filter results on xml columns so that the xml indexes can be used to their full potential.

Various circumstances may hinder the ability to always create queries that use the optimal access pattern, but understanding the techniques will help you select the best solution for your circumstance.

In addition, my examples do not have wildcard paths (e.g. //, /@*), but the techniques should apply equally as well. To compare the different queries, I used SQL Server Profiler and captured the CPU and Read measurements on my computer. These numbers will likely differ from your own measurements and should only be used for comparative analysis between queries run on the same computer.

I’ll be using the AdventureWorks sample database and all my queries will retrieve data from the Sales.Individual table. This table has a typed XML column called Demographics. Here is a sample of the data stored in this column:

<Education>Bachelors </Education>

<Occupation>Professional</Occupation>

<CommuteDistance>5-10 Miles</CommuteDistance>

</IndividualSurvey>

There are 18,484 records in the table and I’ve chosen to locate the 725 records (just a count(*)) where the Demographic column has /IndividualSurvey/Occupation = ‘Professional’ and /IndividualSurvey/NumberCarsOwned = 3.

I’ll define my filter conditions in variables that each query will use:

DECLARE @job varchar(30), @cars int

SELECT @job = 'Professional', @cars = 3;

My first query is a naïve attempt at retrieving the records. It extracts out each value we want to filter on as separate columns in a derived table. The outer query can then apply our filter:

WITH XMLNAMESPACES(DEFAULT 'https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey')

SELECT count(*) FROM (

SELECT

[Demographics].value(N'/IndividualSurvey[1]/Occupation[1]', 'nvarchar(30)') AS [Occupation]

,[Demographics].value(N'/IndividualSurvey[1]/NumberCarsOwned[1]', 'integer') AS [NumberCarsOwned]

FROM [Sales].[Individual] i ) as A

WHERE A.Occupation = @job AND A.NumberCarsOwned = @cars

CPU	Reads
1,172	175,460

As you might expect, the derived table performs a full scan of the base table since there was no filter criteria. This next query moves the value() methods into the WHERE clause allowing for a single SELECT statement. However, it still treats each filter condition separately, extracting the value out of the xml and comparing it to the filter condition:

WITH XMLNAMESPACES(DEFAULT 'https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey')

SELECT count(*) FROM Sales.Individual

WHERE Demographics.value('/IndividualSurvey[1]/Occupation', 'varchar(30)') = @job

AND

Demographics.value('/IndividualSurvey[1]/NumberCarsOwned', 'int') = @cars

CPU	Reads
1,203	173,999

These queries essentially have the same performance and looking at the show plan reveals that they are executed exactly the same (value.sqlplan). To put the 174k reads into perspective, a full table scan retrieving all columns in the table (including Demographics) completes with less than 5,000 reads. Clearly I must be doing something wrong!

If we look at the XQuery expressions themselves, you see there are no filters within the expressions. The XQuery expressions merely retrieve an element out of the xml and convert it to a SQL data type. The comparison of the xml value to the filter condition is being performed outside the context of the XQuery expression. This severely limits the effectiveness of the xml indexes that exist on the column.

SQL server chooses to perform a clustered index scan of the base table and probe into the primary XML index for each record to retrieve the value of the xml node. The probes to get the xml node values are leveraging the primary xml index which saves the server from having to parse the xml in each record, but this probing is costing us a lot of reads. We would like to see the xml index put to better use to keep from scanning the entire base table.

Cardinality

If we know that only 1 survey will be stored per record in the table, the XML data can be considered a document rather than a fragment. In addition, if we know the Occupation and NumberCarsOwned elements are also single occurrence, we can modify the queries slightly to indicate the known singleton aspect of our data.

WITH XMLNAMESPACES(DEFAULT 'https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey')

SELECT count(*) FROM Sales.Individual

WHERE

[Demographics].value(N'(/IndividualSurvey/Occupation)[1]', 'nvarchar(30)') = @job

AND

[Demographics].value(N'(/IndividualSurvey/NumberCarsOwned)[1]', 'int') = @cars

CPU	Reads
906	111,032

This yields 60,000 fewer reads than before. The execution plan is very similar but a Compute Scalar task was taken out of each primary xml lookup (valueSingleton.sqlplan).

Anytime you have known singleton cardinality be sure to specify it in your queries. SQL Server can infer the cardinality from the schema (for typed xml), however if there are wildcard paths in your expressions SQL Server will no longer be able to infer cardinality. The cardinality of the root node can only be inferred from the column definition when assigning the schema (Document vs. Content).

The Demographics column was defined as Content which makes specifying cardinality in the query important. Obviously if we needed to store multiple surveys per record, we could not leverage the singleton cardinality optimization. Based on the IndividualSurvey schema used by the Demographics column, SQL Server knows the Occupation and NumberCarsOwned nodes are single occurrence, but it can’t perform any optimizations because the root node is not defined as single occurrence.

APPLY Operator

An interesting variation of our earlier derived table query involves using the nodes() method and the new CROSS APPLY operator:

WITH XMLNAMESPACES(DEFAULT 'https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey')

SELECT count(*) FROM (

SELECT

[IndividualSurvey].ref.value(N'Occupation[1]', 'nvarchar(30)') AS [Occupation]

,[IndividualSurvey].ref.value(N'NumberCarsOwned[1]', 'integer') AS [NumberCarsOwned]

FROM [Sales].[Individual] i

CROSS APPLY i.Demographics.nodes(N'/IndividualSurvey') AS [IndividualSurvey](ref)) as A

WHERE A.Occupation = @job AND A.NumberCarsOwned = @cars

CPU	Reads
1,500	86,917

This query has a different execution plan and the measurements reflect this (crossapply.sqlplan). There is more CPU consumption, but more than a 50% reduction in the number of reads from the original derived table query. Part of this reduction is based on SQL Server’s use of the PATH xml index in a Merge join. However, this query is still scanning the base table and extracting out the xml values for each row.

Making It all Better

There are a couple issues with the queries we’ve looked at so far. First, where I placed a filter in the WHERE clause, I did so with separate XQuery expressions for each filter criteria and most importantly, the filter value was outside of the XQuery expression. As a result, when SQL Server processes the query, it chooses to do a clustered index scan on the base table followed by a lookup into the primary xml index for Occupation. For each of the remaining records, a second lookup is performed into the primary xml index for the NumberCarsOwned.

The second issue is where I used derived tables, there was no filter condition in the inner query which results in a full table scan on the base table.

If we want to try and place our filter criteria in the XQuery expression, we need to use a different method. Enter the exist() method. This is the preferred means of filtering XML data. It returns 1 indicating a match and 0 if no match is found. Now we can write a single XQuery expression that also contains our filter conditions:

WITH XMLNAMESPACES(DEFAULT 'https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey')

SELECT count(*)

FROM sales.individual

WHERE Demographics.exist('(/IndividualSurvey[Occupation = sql:variable("@job") and NumberCarsOwned = sql:variable("@cars")])[1]') = 1

CPU	Reads
140	12,193

This query gives us the best performance so far. The execution plan shows the xml indexes are now the driving force behind the plan (exist.sqlplan). By placing the filter criteria inside the XQuery expression, the query processor is able generate a query plan that filters out records based on the xml indexes.

We’ve gone from over 170k reads down to 12k reads by following better practices to filter on xml data. As I already mentioned, a full table scan retrieving all column values is less than 5,000 reads. Accessing the xml data definitely has its cost. However, there is more we can do to improve this if we are willing to add computed columns to the table.

Promoted Columns

Since the data I’m looking for are simple values, it is easy to promote them to computed columns in the table. Doing so will allow me to place a traditional index on the computed columns. Functions.sql creates two user defined schema bound functions that return the xml node values we want. We can then use these functions to add two new columns to the Sales.Individual table:

ALTER TABLE Sales.Individual

ADD cOccupation AS dbo.udf_get_Individual_Occupation(Demographics),

cNumberCarsOwned AS dbo.udf_get_Individual_NumberCarsOwned(Demographics)

For performance comparison, I want to add two additional columns and extract the xml values into these columns so they are stored as regular varchar and int values:

ALTER TABLE Sales.Individual

ADD rOccupation varchar(30), rNumberCarsOwned int

UPDATE Sales.Individual

SET rOccupation = dbo.udf_get_Individual_Occupation(Demographics),

rNumberCarsOwned = dbo.udf_get_Individual_NumberCarsOwned(Demographics)

We can now run queries against these new columns:

--Computed Columns

SELECT count(*) FROM Sales.Individual WHERE cOccupation = @job and cNumberCarsOwned = @cars

--Regular Columns

SELECT count(*) FROM Sales.Individual WHERE rOccupation = @job and rNumberCarsOwned = @cars

No Index

Indexed

Computed columns

CPU: 4,203

CPU: 0

Regular columns

CPU: 31

CPU: 0

Without indexes on these new columns, both queries are effectively doing a full scan of the table (clustered index scan). Although we have xml indexes on the Demographics column, the computed columns query doesn’t use them. This is expected but perhaps not obvious. For each row, the xml data is being extracted out of the table and copied into the xml variable parameter of the function. The XQuery expression is being executed against the xml variable so the indexes on the xml column are irrelevant. The XQuery expression must parse the xml for each row and then search for the proper node resulting in very high CPU consumption.

For the indexed tests, there was no meaningful difference in performance between having a single multi column index and having separate indexes on each column. Be sure both columns are indexed, if not you will see much higher CPU utilization. This applies to any access to the column regardless of whether it is in a select clause or where clause. One exception to creating an index would be if you are going to do an exact match on the primary key of the table or some other highly restrictive filter criteria on relational data, you could get away without having the index since you would be accessing very few rows.

It is worth mentioning that having the index on the computed column will slow data modification operations that update the xml column since the value in the computed column will have to be recalculated and stored in the index.

Conclusions

I’ve talked about how to efficiently filter SQL queries based on xml data. The best performance comes from mapping data you want to filter on to computed columns and indexing the columns. However, this obviously can’t be performed for all xml values you might want to filter on. Our next best scenario is using the exist() method. With this method we can combine multiple filter conditions in one XQuery expression and we can also use the sql:variable() and sql:column() functions to give XML access to SQL relational values. This latter ability is important for maximizing the usefulness of the XML indexes.

There are many more ways to access xml data in SQL Server. I have focused on filtering query results based on xml data. Understanding the performance impacts of each technique should help you make better decisions on how to store and access your data.

References:

SQL Server BOL: XML Best Practices

SQL Server BOL: xml Data Type

XML Indexes in SQL Server 2005

https://www.microsoft.com/indonesia/msdn/xmlindexes.aspx

XML Best Practices for Microsoft SQL Server 2005

https://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql25xmlbp.asp

XQuery Implementation in a Relational Database System

https://www.vldb2005.org/program/paper/thu/p1175-pal.pdf

Indexing XML Data Stored in a Relational Database

www.vldb.org/conf/2004/IND5P2.PDF

files.zip