Tipping point Nested loop join and hash join ?

sakuraime 2,321 Reputation points
2020-09-03T05:41:16.94+00:00

are there any tipping point for the query optimizer to choose between Nested loop join and hash join

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,987 questions
0 comments No comments
{count} votes

Accepted answer
  1. m 4,271 Reputation points
    2020-09-03T08:59:17.253+00:00

    Hi @sakuraime ,

    are there any tipping point for the query optimizer to choose between Nested loop join and hash join?

    In fact, SQL Server has three Join methods, Nested Loops Join, Merge Join and Hash Join. Of these three methods, no one is always the best, but they all have their most suitable context. SQL Server will select the most appropriate connection method according to the table structure on which the two result sets are based and the size of the result set. Of course, the user can also specify the Join method in the statement. SQL Server will try to respect your choice. (Some queries may not have an execution plan according to the specified Join method, and SQL Server will report an error.)
    Let us first look at a set of comparisons:
    22406-20200903joincomparison.jpg

    Next,let's compare Nested Loops Join and Hash Join in detail:

    Nested Loops Join is a basic connection method. It does not require SQL Server to establish another data structure for the connection, so it saves memory space and does not need to use Tempdb space. It applies to a wide range of Join types. Some connections cannot be done by Merge and Hash, but Nexted Loops can be done, so the advantages of this connection method are obvious, but the disadvantages are also obvious.

    1. The complexity of the algorithm is equal to the Inner Table multiplied by the Outer Table
    2. The data set of Outer Table should be sorted in advance in order to improve retrieval efficiency
    3. It is better to have an index on the Inner Table to support retrieval

    In short, Nested Loops Join has the highest efficiency for a relatively small data set, so it is widely used in SQL Server. When SQL Server finds that it can choose a small data set as the Outer Table, it often chooses Nested Loops, which has better performance, but Nested Loops Join is too sensitive to the size of the data set. If SQL Server predicts an error and uses a large data set as an Outer Table, performance will drop sharply. Many statement performance problems are caused by this.

    The advantages of Hash Join are obvious:

    1. Its algorithm complexity is to traverse the data sets on both sides separately.
    2. It does not require the data set to be sorted in advance, nor does it require an index on it
    3. It can be easily upgraded to a parallel execution plan using multiple processors

    In short, Hash Join is suitable for the case where the data set to be joined is relatively large, and there is no suitable index above. However, Hash Join is not an optimal Join algorithm, but SQL Server is not optimized for input (Join's data set is relatively large, or
    There is no suitable index above) a last resort option. This is because Hash Join is the most resource-consuming Join algorithm. Before it does Join, it must first create a Hash table in memory. The creation process requires CPU resources, and the Hash table needs to be stored in memory or Tempdb. The Join process also uses CPU resources to calculate ("Probe"). If many users are using the Hash algorithm to join at the same time, it will affect the overall SQL Server
    The burden is relatively heavy. From the perspective of reducing the overall load of SQL Server, it is still necessary to reduce the size of the data set input by Join as much as possible, with an appropriate index, and guide SQL Server to use Nested Loops Join or Merge Join as much as possible.

    More information: understanding-sql-server-physical-joins

    If the reply is helped, please do “Accept Answer”.
    BR,
    Mia

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Chang, Joe 111 Reputation points
    2020-09-03T12:58:33.727+00:00

    Preliminary: in a Loop Join, the join condition can be used as a search argument to the inner source. In Hash and Merge, the sources are processed separately. In the situation that a SARG is specified for only one table, and an index exists on the Join column of the second table, then a Loop join could potentially have a more efficient plan than Hash and Merge, for which a scan of the second table is required.
    Main: there are formulas behind the cost of operations in a SQL Server execution plan, example: Index Seek/Scan initial/first row, then subsequent rows and pages.
    There are also formula's for the cost of Hash and Merge joins. Note the hash join cost depends on the size of the rows it must keep, and depending on SQL Server target memory, it may spill to tempdb, affecting plan cost. Parallelism is another factor.
    I have partial models of the cost model in general here : http://qdpma.com/CBO/SQLServerCostBasedOptimizer.html,
    and for joins: http://qdpma.com/CBO/CBO05_Joins.html
    Also, search for Paul White (page free space) https://www.sql.kiwi/ and Benjamin Nevarez http://www.benjaminnevarez.com/

    0 comments No comments

  2. m 4,271 Reputation points
    2020-09-04T01:34:47.937+00:00

    Hi @sakuraime ,

    Is the reply helpful?

    If the reply is helped,please do "Accept Answer".
    BR,
    Mia

    0 comments No comments