More on TOP

Last week I wrote about a special case of the TOP operator known as ROWCOUNT TOP.  This week I'll take a look at some other interesting TOP scenarios.  In general, TOP is a fairly mundane operator.  It simply counts and returns the specified number of rows.  SQL Server 2005 does include two enhancements to TOP that were not present in SQL Server 2000.  First, in SQL Server 2000, we can only specify an integer constant for the number of rows to return.  In SQL Server 2005, we can specify an arbitrary expression, including an expression containing T-SQL variables or parameters.  Second, SQL Server 2000 only permits TOP in a SELECT statement (though it does support ROWCOUNT TOP in INSERT, UPDATE, and DELETE statements).  SQL Server 2005 permits TOP with SELECT, INSERT, UPDATE, and DELETE statements.

In this post, I'm going to focus only on some simple examples involving SELECT statements.  Let's create a small table to get started:

CREATE TABLE T (A INT, B INT)
CREATE CLUSTERED INDEX TA ON T(A)

SET NOCOUNT ON
DECLARE @i INT
SET @i = 0
WHILE @i < 100
  BEGIN
    INSERT T VALUES (@i, @i)
    SET @i = @i + 1
  END
SET NOCOUNT OFF

The plan for the simplest possible TOP query should not need any explanation:

SELECT TOP 5 * FROM T

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5)))
5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

TOP is often used in conjunction with ORDER BY.  Combining TOP with ORDER BY adds determinism to the set of rows returned.  Without the ORDER BY, the set of rows returned depends on the query plan and may even vary from execution to execution.  If we have an index to provide order, we continue to get a simple query plan (notice the ORDERED FORWARD keywords):

SELECT TOP 5 * FROM T ORDER BY A

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5)))
5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]), ORDERED FORWARD)

If we do not have an index to provide order, SQL Server must sort the data:

SELECT TOP 5 * FROM T ORDER BY B

Rows   Executes
5      1        |--Sort(TOP 5, ORDER BY:([tempdb].[dbo].[T].[B] ASC))
100    1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Note that without an index to find the top 5 rows, SQL Server must scan all 100 rows in the input table.  Also note that the sort is actually a "top sort."  A top sort generally uses less memory than a regular sort as it only needs to identify and sort the top few rows rather than sorting the entire input.

Now let's consider what happens if we request the top 5% of rows.  To figure out the actual number of rows to return, SQL Server must count all of the rows and calculate 5%.  This makes queries that use TOP PERCENT less efficient than queries that use TOP with an absolute row count.

SELECT TOP 5 PERCENT * FROM T

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)
5      1             |--Table Spool
100    1                 |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Like the prior example, SQL Server must scan all 100 rows in the input table.  In this example, SQL Server uses an eager spool which reads and counts all of the input rows before returning any rows.  The top then requests the row count from the spool, calculates 5%, and proceeds like any other top.

If SQL Server must sort, the sort can also provide the count of input rows.  However, only a regular sort can provide this count.  A top sort must know the number of rows to return from the start.

SELECT TOP 5 PERCENT * FROM T ORDER BY B

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)
5      1             |--Sort(ORDER BY:([tempdb].[dbo].[T].[B] ASC))
100    1                  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

TOP WITH TIES is also incompatible with top sort.  TOP WITH TIES cannot know for certain how many rows will be returned until the number of ties is determined.  For this example, let's add a tie for the fifth row returned:

INSERT T VALUES (4, 4)

SELECT TOP 5 WITH TIES * FROM T ORDER BY B

Rows   Executes
6      1        |--Top(TOP EXPRESSION:((5)))
7      1             |--Sort(ORDER BY:([tempdb].[dbo].[T].[B] ASC))
101    1                  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Although the above plan does not seem to reflect that we have a TOP WITH TIES, the argument column of SHOWPLAN_ALL or STATISTICS PROFILE includes this information:  "TIE COLUMNS:([T].[B])".  This information is also available in the graphical and XML plans on SQL Server 2005.  Note that the TOP actually returns one extra row.  When a TOP N WITH TIES reaches the Nth row, it keeps a copy of the tie columns for this row (in this example B==4) and compares each subsequent row to that row.  As long as there are rows that match, it keeps returning them.  Because the top must compare subsequent rows until it finds a non-match to the Nth row, the top retrieves one more row from the sort than it returns.

Finally, there are a couple of special cases.  The optimizer knows that TOP 0 and TOP 0 PERCENT never return any rows and replaces any query plan with a TOP 0 with a constant scan:

SELECT TOP 0 * FROM T

  |--Constant Scan

The optimizer also knows that TOP 100 PERCENT always returns all rows and removes the top operator from the query plan:

SELECT TOP 100 PERCENT * FROM T

  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Both of these optimizations require that the number of rows is a constant.  For example, using an expression that includes a T-SQL variable or parameter will prevent the optimization.  Both optimizations also work with INSERT, UPDATE, and DELETE statements.

Note that using TOP to work around SQL language restrictions on the use of ORDER BY in sub-selects or in views or to force specific plan evaluation orders is not recommended.

Comments

  • Anonymous
    August 09, 2007
    The comment has been removed

  • Anonymous
    August 15, 2007
    The table spool does create a copy of the table though it only copies those columns that are specified in the SELECT list of the query.  In some cases, this example is likely one of those cases, it might be faster to perform two clustered index scans.  However, if the table has many large columns that are not output, the spool could actually make the query faster by eliminating unnecessary columns from the second scan and, thus, reducing the number of I/Os.  More importantly, if the input to the spool is a complex operation such as an aggregation or a join, it could be much faster to spool the results than to recompute them.

  • Anonymous
    November 18, 2007
    The comment has been removed

  • Anonymous
    November 26, 2007
    The comment has been removed

  • Anonymous
    July 26, 2010
    Hi there is an issue I am facing with TOP Clause. Please guide me on this. The scenario is: In my Product table I have a ProductID & ProductPrice column. ProductID column has unique values while in ProductPrice column some values/data are similar. The issue is when using top 1 the query returns the second row and not the first one. Query Select ProductID from dbo.Product order by ProductPrice select top 1 ProductID from dbo.Product order by ProductPrice The result set is different. Why?

  • Anonymous
    July 27, 2010
    SQL Server uses a full sort for the first query and a top sort for the second query.  Top sort is faster than a full sort when you only need a small number of rows.  However, the two operators can (as you've observed) produce different orders for duplicate rows.  You can resolve this issue by either adding a second column to the sort (e.g., ... order by ProductPrice, ProductID) or by requesting duplicates from the top operation (e.g., select top 1 with ties Product ID from ...).  Note that using top with ties will force a full sort (which may hurt performance) since the actual number of rows needed is unknown.

  • Anonymous
    July 27, 2010
    Thank you Craig for the reply. It means that the query is not able to distinguish the rows having same data for the ordered column. So every time making a query, I always need to bear in mind that both the queries result set will be different :) & query accordingly. Does it make sense from the programmers point of view? As they can't always predict the data whether it  will be same or not. In order to make the query faster why did they forgot that one should receive the correct data/result set evenif using any sql clauses . I am grateful to you for making the point clear to me. If possible can you send me some link or document that explains exactly how the TOP Clause along with Order By clause works internally. Thanks

  • Anonymous
    August 03, 2010
    The comment has been removed