OPTIMIZED Nested Loops Joins

In my past two posts, I explained how SQL Server may add a sort to the outer side of a nested loops join and showed how this sort can significantly improve performance.  In an earlier post, I discussed how SQL Server can use random prefetching to improve the performance of a nested loops join.  In this post, I'm going to explore one more nested loops join performance feature.  I'll use the same database that I used in my two prior posts.  Let's start with the following simple query:

SELECT SUM(Data)
FROM T
WHERE RandKey < 1000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END))
       |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([T].[Data]), [Expr1012]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1010]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (1000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Notice that the nested loops join includes an extra keyword: OPTIMIZED.  This keyword indicates that the nested loops join may try to reorder the input rows to improve I/O performance.  This behavior is similar to the explicit sorts that we saw in my two previous posts, but unlike a full sort it is more of a best effort.  That is, the results from an optimized nested loops join may not be (and in fact are highly unlikely to be) fully sorted.

SQL Server only uses an optimized nested loops join when the optimizer concludes based on its cardinality and cost estimates that a sort is most likely not required, but where there is still a possibility   that a sort could be helpful in the event that the cardinality or cost estimates are incorrect.  In other words, an optimized nested loops join may be thought of as a "safety net" for those cases where SQL Server chooses a nested loops join but would have done better to have chosen an alternative plan such as a full scan or a nested loops join with an explicit sort.  For the above query which only joins a few rows, the optimization is unlikely to have any impact at all.

Let's look at an example where the optimization actually helps:

SELECT SUM(Data)
FROM T
WHERE RandKey < 100000000 AND
    Flags & 0x1 = 0x1 AND
    Flags & 0x2 = 0x2 AND
    Flags & 0x4 = 0x4 AND
    Flags & 0x8 = 0x8

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1014]=(0) THEN NULL ELSE [Expr1015] END))
       |--Stream Aggregate(DEFINE:([Expr1014]=COUNT_BIG([T].[Data]), [Expr1015]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)),  WHERE:(([T].[Flags]&(1))=(1) AND ([T].[Flags]&(2))=(2) AND ([T].[Flags]&(4))=(4) AND ([T].[Flags]&(8))=(8)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The Flags column contains the value 0xFF in every row.  Thus, every one of the bitwise AND predicates evaluates to true and this query returns about 2.5 million rows or 10% of the table.  Ordinarily, when faced with a query like this one, SQL Server would resort to a sequential scan of the entire table.  Indeed, if you try this query without the extra bitwise filters, you will get a sequential scan.  However, SQL Server does not realize that these predicates are always true, estimates a much lower cardinality of less than 10,000 rows, and chooses a simple nested loops join plan.  Note that I would generally recommend against using predicates like these ones in a real world application precisely because they will lead to cardinality estimation errors and poor plans.

To see what effect the optimized nested loops join has, let's compare the above plan with an "un-optimized" nested loops join.  We can eliminate the optimization by using the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is very small:

UPDATE STATISTICS T WITH ROWCOUNT = 1, PAGECOUNT = 1

I'll compare the above query with the following simpler query which uses essentially the same plan and touches the same data but has an "un-optimized" nested loops join:

SELECT SUM(Data)
FROM T WITH (INDEX (IRandKey))
WHERE RandKey < 100000000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

We can reset the statistics using the following statement:

UPDATE STATISTICS T WITH ROWCOUNT = 25600000, PAGECOUNT = 389323

As in my last post, I'm going to simulate a larger table by reducing the memory available to the server to 1 GByte with SP_CONFIGURE 'MAX SERVER MEMORY' and I'm also going to flush the buffer pool between runs with DBCC DROPCLEANBUFFERS.

Note that you will NOT want to run these statements on a production server.

I ran both of the above queries with three different constants.  Here are my results.  Keep in mind that these results depend greatly on the specific hardware.  If you try this experiment, your results may vary.

 

Execution Time

Increase

OPTIMIZED

"un-OPTIMIZED"

Constant

10,000,000(1% of rows)

6.5 minutes

26 minutes

4x

100,000,000(10% of rows)

10.4 minutes

4.3 hours

25x

250,000,000(25% of rows)

11.3 minutes

10.6 hours

56x

Clearly the optimized nested loops join can have a huge impact on performance.  Moreover, as the plan touches more rows the benefit of the optimization grows dramatically.  Although a full scan or a nested loops join with an explicit sort would be faster, the optimized nested loops join really is a safety net protecting against a much worse alternative.

Comments

  • Anonymous
    May 11, 2010
    The comment has been removed

  • Anonymous
    May 25, 2010
    The comment has been removed

  • Anonymous
    June 26, 2012
    Hi I would like to point out that the merge join algorithm is not efficient. When the inner table is accessed the scan starts at the first row and the table is scanned till it find the value of join key which is more than the value of join key from outer table. As soon as the join key value is more than the outer table join key row, the scan is not required. This is quite efficient algorithm to end the scan. However, the scan always starts at the first value or first row( I am assuming that there are no other sarg's which are causing the scan to start somewhere else or cause other index to be used). This part could be very inefficient, e.g. if we have clustered index on id and say outer table has minimm value 100000 and max value 105000.  Thus the scan of inner table which has clustered index on id should start at 100000 and end at 105000.Currently it is not happening.Currently it is starting at 1 and finsihing at 105000. Thus the rows from 1 to 99999 are uselessly processed. This is waste of CPU as well as it causes more logical reads. I have used an alternate method to make sure that scan starts at 100000 instead of 1. You will find the query and more details in the below link. . gullimeelsqlsybase.wordpress.com/.../improved-merge-join-algorithm-in-sql-server-2008 Thanks

  • Anonymous
    December 31, 2012
    @Steve I was curious for more details about implementation too. It is of course possible to infer some general idea from trying different values of @i  and looking at the actual number of rows in the plan. DECLARE @Data INT, @i int = 1 SELECT TOP (@i) @Data = Data FROM T WHERE RandKey < 100000000 AND    Flags & 0x1 = 0x1 AND    Flags & 0x2 = 0x2 AND    Flags & 0x4 = 0x4 AND    Flags & 0x8 = 0x8 OPTION (MAXDOP 1, OPTIMIZE FOR (@i=10000))