Troubleshoot queries that seem to never end in SQL Server

This article describes the troubleshooting steps for the issue where you have a query that seems to never complete, or getting it to complete may take many hours or days.

What is a never-ending query?

This document focuses on queries that continue to execute or compile, that is, their CPU continues to increase. It doesn't apply to queries that are blocked or waiting on some resource that is never released (the CPU remains constant or changes very little).

Important

If a query is left to finish its execution, it will eventually complete. It could take just a few seconds, or it could take several days.

The term never-ending is used to describe the perception of a query not completing when in fact, the query will eventually complete.

Identify a never-ending query

To identify whether a query is continuously executing or stuck on a bottleneck, follow these steps:

  1. Run the following query:

    DECLARE @cntr int = 0
    
    WHILE (@cntr < 3)
    BEGIN
        SELECT TOP 10 s.session_id,
                        r.status,
                        r.wait_time,
                        r.wait_type,
                        r.wait_resource,
                        r.cpu_time,
                        r.logical_reads,
                        r.reads,
                        r.writes,
                        r.total_elapsed_time / (1000 * 60) 'Elaps M',
                        SUBSTRING(st.TEXT, (r.statement_start_offset / 2) + 1,
                        ((CASE r.statement_end_offset
                            WHEN -1 THEN DATALENGTH(st.TEXT)
                            ELSE r.statement_end_offset
                        END - r.statement_start_offset) / 2) + 1) AS statement_text,
                        COALESCE(QUOTENAME(DB_NAME(st.dbid)) + N'.' + QUOTENAME(OBJECT_SCHEMA_NAME(st.objectid, st.dbid)) 
                        + N'.' + QUOTENAME(OBJECT_NAME(st.objectid, st.dbid)), '') AS command_text,
                        r.command,
                        s.login_name,
                        s.host_name,
                        s.program_name,
                        s.last_request_end_time,
                        s.login_time,
                        r.open_transaction_count,
                        atrn.name as transaction_name,
                        atrn.transaction_id,
                        atrn.transaction_state
            FROM sys.dm_exec_sessions AS s
            JOIN sys.dm_exec_requests AS r ON r.session_id = s.session_id 
                    CROSS APPLY sys.Dm_exec_sql_text(r.sql_handle) AS st
            LEFT JOIN (sys.dm_tran_session_transactions AS stran 
                 JOIN sys.dm_tran_active_transactions AS atrn
                    ON stran.transaction_id = atrn.transaction_id)
            ON stran.session_id =s.session_id
            WHERE r.session_id != @@SPID
            ORDER BY r.cpu_time DESC
    
        SET @cntr = @cntr + 1
    WAITFOR DELAY '00:00:05'
    END
    
  2. Check the sample output.

    • The troubleshooting steps in this article are specifically applicable when you notice an output similar to the following one where the CPU is increasing proportionately with the elapsed time, without significant wait times. It's important to note that changes in logical_reads aren't relevant in this case as some CPU-bound T-SQL requests might not do any logical reads at all (for example performing computations or a WHILE loop).

      session_id status cpu_time logical_reads wait_time wait_type
      56 running 7038 101000 0 NULL
      56 runnable 12040 301000 0 NULL
      56 running 17020 523000 0 NULL
    • This article isn't applicable if you observe a wait scenario similar to the following one where the CPU doesn't change or changes very slightly, and the session is waiting on a resource.

      session_id status cpu_time logical_reads wait_time wait_type
      56 suspended 0 3 8312 LCK_M_U
      56 suspended 0 3 13318 LCK_M_U
      56 suspended 0 5 18331 LCK_M_U

    For more information, see Diagnose waits or bottlenecks.

Long compilation time

On rare occasions, you might observe that the CPU is increasing continuously over time but that's not driven by query execution. Instead, it could be driven by an excessively long compilation (the parsing and compiling of a query). In those cases, check the transaction_name output column and look for a value of sqlsource_transform. This transaction name indicates a compilation.

Collect diagnostic data

To collect diagnostic data by using SQL Server Management Studio (SSMS), follow these steps:

  1. Capture the estimated query execution plan XML.

  2. Review the query plan to see if there are any obvious indications of where the slowness can come from. Typical examples include:

    • Table or index scans (look at estimated rows).
    • Nested loops driven by a huge outer table data set.
    • Nested loops with a large branch in the inner side of the loop.
    • Table spools.
    • Functions in the SELECT list that take a long time to process each row.
  3. If the query runs fast at any time, you can capture the "fast" executions Actual XML Execution Plan to compare.

Method to review the collected plans

This section will illustrate how to review the collected data. It will use the multiple XML query plans (using extension *.sqlplan) collected in SQL Server 2016 SP1 and later builds and versions.

Follow these steps to compare execution plans:

  1. Open a previously saved query execution plan file (.sqlplan).

  2. Right-click in a blank area of the execution plan and select Compare Showplan.

  3. Choose the second query plan file that you would like to compare.

  4. Look for thick arrows that indicate a large number of rows flowing between operators. Then select the operator before or after the arrow, and compare the number of actual rows across two plans.

  5. Compare the second and third plans to see if the largest flow of rows happens in the same operators.

    Here's an example:

    Compare query plans in SSMS.

Resolution

  1. Ensure that statistics are updated for the tables used in the query.

  2. Look for a missing index recommendation in the query plan and apply any.

  3. Rewrite the query with the goal to simplify it:

    • Use more selective WHERE predicates to reduce the data processed up-front.
    • Break it apart.
    • Select some parts into temp tables, and join them later.
    • Remove TOP, EXISTS, and FAST (T-SQL) in the queries that run for a very long time due to optimizer row goal. Alternatively, you can use the DISABLE_OPTIMIZER_ROWGOAL hint. For more information, see Row Goals Gone Rogue.
    • Avoid using Common Table Expressions (CTEs) in such cases as they combine statements into a single big query.
  4. Try using query hints to produce a better plan:

    • HASH JOIN or MERGE JOIN hint
    • FORCE ORDER hint
    • FORCESEEK hint
    • RECOMPILE
    • USE PLAN N'<xml_plan>' if you have a fast query plan that you can force
  5. Use Query Store (QDS) to force a good known plan if such a plan exists and if your SQL Server version supports Query Store.

Diagnose waits or bottlenecks

This section is included here as a reference in case your issue isn't a long-running CPU driving query. You can use it to troubleshoot queries that are long due to waits.

To optimize a query that's waiting on bottlenecks, identify how long the wait is and where the bottleneck is (the wait type). Once the wait type is confirmed, reduce the wait time or eliminate the wait completely.

To calculate the approximate wait time, subtract the CPU time (worker time) from the elapsed time of a query. Typically, the CPU time is the actual execution time, and the remaining part of the lifetime of the query is waiting.

Examples of how to calculate approximate wait duration:

Elapsed Time (ms) CPU Time (ms) Wait time (ms)
3200 3000 200
7080 1000 6080

Identify the bottleneck or wait

  • To identify historical long-waiting queries (for example, >20% of the overall elapsed time is wait time), run the following query. This query uses performance statistics for cached query plans since the start of SQL Server.

    SELECT t.text,
             qs.total_elapsed_time / qs.execution_count
             AS avg_elapsed_time,
             qs.total_worker_time / qs.execution_count
             AS avg_cpu_time,
             (qs.total_elapsed_time - qs.total_worker_time) / qs.execution_count
             AS avg_wait_time,
             qs.total_logical_reads / qs.execution_count
             AS avg_logical_reads,
             qs.total_logical_writes / qs.execution_count
             AS avg_writes,
             qs.total_elapsed_time
             AS cumulative_elapsed_time
    FROM sys.dm_exec_query_stats qs
             CROSS apply sys.Dm_exec_sql_text (sql_handle) t
    WHERE (qs.total_elapsed_time - qs.total_worker_time) / qs.total_elapsed_time
             > 0.2
    ORDER BY qs.total_elapsed_time / qs.execution_count DESC
    
  • To identify currently executing queries with waits longer than 500 ms, run the following query:

    SELECT r.session_id, r.wait_type, r.wait_time AS wait_time_ms
    FROM sys.dm_exec_requests r 
       JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id 
    WHERE wait_time > 500
    AND is_user_process = 1
    
  • If you can collect a query plan, check the WaitStats from the execution plan properties in SSMS:

    1. Run the query with Include Actual Execution Plan on.
    2. Right-click the left-most operator in the Execution plan tab
    3. Select Properties and then WaitStats property.
    4. Check the WaitTimeMs and WaitType.
  • If you're familiar with PSSDiag/SQLdiag or SQL LogScout LightPerf/GeneralPerf scenarios, consider using either of them to collect performance statistics and identify waiting queries on your SQL Server instance. You can import the collected data files and analyze the performance data with SQL Nexus.

References to help eliminate or reduce waits

The causes and resolutions for each wait type vary. There's no one general method to resolve all wait types. Here are articles to troubleshoot and resolve common wait type issues:

For descriptions of many Wait types and what they indicate, see the table in Types of Waits.