Presorting the Data

Article
04/05/2007

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Inside SQL Server
Tuning of a Different Sort
Use indexes to retrieve sorted data more efficiently

Kalen Delaney

Learning how SQL Server indexes are physically organized and how they can help speed up data access in a table is the most important step in understanding query tuning. By adding a good index to a large table that previously had no useful index for a query, you can obtain one of the largest gains possible for any type of SQL Server tuning. If a large table, with tens of thousands of pages or more, has no useful index, SQL Server must scan the entire table to find the data your query needs, even if you're interested in only one row.

Until SQL Server finishes searching the entire table, it doesn't know how many rows qualify. With a useful index, SQL Server might need to access only a handful of pages as it traverses each index level. The difference between accessing tens of thousands of pages and accessing only a handful of pages makes indexes a powerful tuning tool.

But even when you're concerned with data in just one table, you need to address queries other than those that search for rows meeting a certain condition. Sometimes, queries involve organizing the data in a table. Queries involving the GROUP BY, DISTINCT, ORDER BY, and TOP clauses fall into this category. Technically, ORDER BY and TOP apply to the resultset a query generates, not to the table data itself; however, for query-tuning purposes, that distinction isn't relevant.

Let's look first at ORDER BY, which requests data from SQL Server in a particular sorted order. One reason to look at ORDER BY first is that you can apply the same principles that let you tune queries that involve sorting to queries that involve the other organizational clauses I mentioned: GROUP BY, DISTINCT, and TOP. Although ORDER BY applies to a SELECT statement's resultset, SQL Server can sometimes use indexes on the data to help it retrieve the rows in the order you want them so that it doesn't need to further sort the resultset.

One obvious way that you can use indexes to make retrieving sorted data more efficient is to create a clustered index on the column that you're going to order by. When you create a clustered index, SQL Server reorganizes the data pages so that the rows are logically stored in clustered-index order. SQL Server doesn't necessarily store the data physically on the disk in clustered-index order, but while creating an index, SQL Server attempts to physically order the data as close to the logical order as possible. Each page in an index's leaf level has a pointer to the page that logically precedes the current page and to the page that logically follows the current page, thereby creating a doubly linked list. The sysindexes table contains the address of the first leaf-level page. Because the data is guaranteed to be logically in clustered-index order, SQL Server can just start at the first page and follow the index pointers from one page to the next to retrieve the data in order.

Now let's look at a query plan involving a sort on a clustered-index column. In the Northwind database, the Customers table has a clustered index on the CustomerID column. (SQL Server automatically created the index when the primary key was defined on the CustomerID column.) If I select all the rows from the Customers table and sort by the CustomerID column, I might expect SQL Server to use the clustered index to retrieve the rows already sorted in the correct order. After running SET SHOWPLAN_TEXT ON, run the following query:

  
--Query 1:
SELECT * FROM Customers
ORDER BY CustomerID

Here's the SHOWPLAN output, which tells you that SQL Server scanned the clustered index to retrieve the data, just as you'd expect:

  
StmtText
-----------------------------------------  
|--Clustered Index Scan(OBJECT:
([Northwind].[dbo].[Customers].
[PK_Customers]), ORDERED FORWARD)

For comparison, let's execute Query 2, which requests the data sorted by a column that doesn't have an index—Country:

  
--Query 2:
SELECT * FROM Customers
ORDER BY Country

The SHOWPLAN output includes the extra step of sorting the data, so you can see that the query is more efficient if a clustered index already exists on the sort column:

  
StmtText
--------------------------------------------
|--Sort(ORDER BY:([Customers].[Country] ASC))
    |--Clustered Index Scan(OBJECT:([Northwind]
.[dbo].[Customers].[PK_Customers]))

Note that the query plan for the first query includes the ORDERED FORWARD option, indicating that the index must be scanned in a particular logical order. The second query plan doesn't include that option; the order in which SQL Server scans the clustered index is irrelevant. The only important occurrence is that all the key values are accessed so that they can then be sorted.

Turn It Around

You're probably aware that T-SQL's ORDER BY clause lets you request that the data be returned sorted in descending order, as Query 3 illustrates:

  
--Query 3:
SELECT * FROM Customers
ORDER BY CustomerID DESC

SQL Server can use the same clustered index to retrieve the data in both ascending and descending order: Because the pages at the leaf level are stored in a doubly linked list, SQL Server can simply traverse the pages from last to first. The SHOWPLAN output for Query 3 is almost identical to the plan for Query 1:

  
StmtText
--------------------------------------------
  |--Clustered Index Scan(OBJECT:([Northwind]
.[dbo].[Customers].[PK_Customers]), ORDERED BACKWARD)

The only difference is that the clustered index scan for Query 3 is a BACKWARD scan.

Here's a query-tuning trick you can use to reduce the initial wait time for queries that sort data on a nonclustered-index key. Generally, you shouldn't use index hints in production applications if you can avoid them. In fact, in most cases, the need to use a hint might be considered a bug in the SQL Server query optimizer. However, for the case of sorting data on a nonclustered index key, you can safely use an index hint. Using the hint doesn't mean that the optimizer isn't working right, and you don't have to be concerned that the hint might no longer be necessary or beneficial after a service pack "fixes" the optimizer. One form of index hint is called FASTFIRSTROW; you can use it as a table hint in your FROM clause. This hint tells the optimizer to use a plan involving a nonclustered index on the sort column. To use the nonclustered index, SQL Server traverses the index's leaf level, retrieving each index key in sorted order; then it follows the bookmark for every index key to find the corresponding data row. (If no such index exists, SQL Server ignores the hint.) In the Order Details table, this process entails 2155 bookmark-lookup operations. The first few rows return very quickly, so SQL Server can begin to fill in your user's drop-down list. However, the time to complete the sort is much longer than if you didn't use the hint.

You can use the SET STATISTICS TIME ON option to see the difference between using and not using FASTFIRSTROW. The code in Listing 1 turns on this SET option, then executes Query 6 both with and without the FASTFIRSTROW hint. On my Toshiba Tecra 8000 laptop computer (with 256MB of RAM, running SQL Server 2000 Enterprise Edition on Windows 2000 Server), the first query (without the hint) used 30 milliseconds (ms) of CPU time. With the hint, the second query used 70ms of CPU time—more than twice as long as the first query. This dataset is so small that I didn't notice the delay in returning the first row when I didn't use the hint. However, when I've run similar queries on tables with just 20,000 rows, the delay before any data is returned was noticeable when I didn't use the FASTFIRSTROW hint; I got no delay with the hint.

When you have a nonclustered index on the sort column, you can choose whether to use the hint. You decide what's more important to you: Do you want the sort to take the least amount of total processing time, or do you want the sorted data to start coming back to the client as quickly as possible? In other words, do you want to minimize initial response time, or maximize throughput?

I mentioned that using the FASTFIRSTROW hint was just one form of the hint to control SQL Server's use of a nonclustered index on a sort column. Another form of this hint, using the OPTION clause with the FAST N hint, lets you specify how many rows (n) SQL Server should return as quickly as possible. For example, using the hint OPTION (FAST 10) tells SQL Server to return the first 10 rows as quickly as possible, then access the rest of the rows in a way that maximizes throughput.

Both clustered and nonclustered indexes can be useful for sorting more efficiently. If sorting is an important part of your applications' data-management operations, you need to consider those operations carefully when deciding the optimal indexes to build on your tables. Next time, I look at indexing for other organizational operations, such as TOP, GROUP BY, and DISTINCT.

Bugs, comments, suggestions | Legal | Privacy | Advertising

Presorting the Data

Turn It Around

Additional resources