Troubleshoot a query that shows a significant performance difference between two servers

Článok
06/20/2023

Applies to: SQL Server

This article provides troubleshooting steps for a performance issue where a query runs slower on one server than on another server.

Symptoms

Assume there are two servers with SQL Server installed. One of the SQL Server instances contains a copy of a database in the other SQL Server instance. When you run a query against the databases on both servers, the query runs slower on one server than the other.

The following steps can help troubleshoot this issue.

Step 1: Determine whether it's a common issue with multiple queries

Use one of the following two methods to compare the performance for two or more queries on the two servers:

Manually test the queries on both servers:
1. Choose several queries for testing with priority placed on queries that are:
  - Significantly faster on one server than on the other.
  - Important to the user/application.
  - Frequently executed or designed to reproduce the issue on demand.
  - Sufficiently long to capture data on it (for example, instead of a 5-milliseconds query, choose a 10-seconds query).
2. Run the queries on the two servers.
3. Compare the elapsed time (duration) on two servers for each query.
Analyze performance data with SQL Nexus.
1. Collect PSSDiag/SQLdiag or SQL LogScout data for the queries on the two servers.
2. Import the collected data files with SQL Nexus and compare the queries from the two servers. For more information, see Performance Comparison between two log collections (Slow and Fast for example).

Scenario 1: Only one single query performs differently on the two servers

If only one query performs differently, the issue is more likely specific to the individual query rather than to the environment. In this case, go to Step 2: Collect data and determine the type of performance issue.

Scenario 2: Multiple queries perform differently on the two servers

If multiple queries run slower on one server than the other, the most probable cause is the differences in server or data environment. Go to Diagnose environment differences and see if the comparison between the two servers is valid.

Step 2: Collect data and determine the type of performance issue

Collect Elapsed time, CPU time, and Logical Reads

To collect elapsed time and CPU time of the query on both servers, use one of the following methods that best fits your situation:

For currently executing statements, check total_elapsed_time and cpu_time columns in sys.dm_exec_requests. Run the following query to get the data:

SELECT 
    req.session_id
    , req.total_elapsed_time AS duration_ms
    , req.cpu_time AS cpu_time_ms
    , req.total_elapsed_time - req.cpu_time AS wait_time
    , req.logical_reads
    , SUBSTRING (REPLACE (REPLACE (SUBSTRING (ST.text, (req.statement_start_offset/2) + 1, 
       ((CASE statement_end_offset
           WHEN -1
           THEN DATALENGTH(ST.text)  
           ELSE req.statement_end_offset
         END - req.statement_start_offset)/2) + 1) , CHAR(10), ' '), CHAR(13), ' '), 
      1, 512)  AS statement_text  
FROM sys.dm_exec_requests AS req
    CROSS APPLY sys.dm_exec_sql_text(req.sql_handle) AS ST
ORDER BY total_elapsed_time DESC;

For past executions of the query, check last_elapsed_time and last_worker_time columns in sys.dm_exec_query_stats. Run the following query to get the data:

SELECT t.text,
     (qs.total_elapsed_time/1000) / qs.execution_count AS avg_elapsed_time,
     (qs.total_worker_time/1000) / qs.execution_count AS avg_cpu_time,
     ((qs.total_elapsed_time/1000) / qs.execution_count ) - ((qs.total_worker_time/1000) / qs.execution_count) AS avg_wait_time,
     qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
     qs.total_logical_writes / qs.execution_count AS avg_writes,
     (qs.total_elapsed_time/1000) AS cumulative_elapsed_time_all_executions
FROM sys.dm_exec_query_stats qs
     CROSS apply sys.Dm_exec_sql_text (sql_handle) t
WHERE t.text like '<Your Query>%'
-- Replace <Your Query> with your query or the beginning part of your query. The special chars like '[','_','%','^' in the query should be escaped.
ORDER BY (qs.total_elapsed_time / qs.execution_count) DESC

Note

If avg_wait_time shows a negative value, it's a parallel query.

If you can execute the query on demand in SQL Server Management Studio (SSMS) or Azure Data Studio, run it with SET STATISTICS TIME ON and SET STATISTICS IO ON.

SET STATISTICS TIME ON
SET STATISTICS IO ON
<YourQuery>
SET STATISTICS IO OFF
SET STATISTICS TIME OFF

Then from Messages, you'll see the CPU time, elapsed time, and logical reads like this:

  Table 'tblTest'. Scan count 1, logical reads 3, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

  SQL Server Execution Times:
    CPU time = 460 ms,  elapsed time = 470 ms.

If you can collect a query plan, check the data from the Execution plan properties.
1. Run the query with Include Actual Execution Plan on.
2. Select the left-most operator from Execution plan.
3. From Properties, expand QueryTimeStats property.
4. Check ElapsedTime and CpuTime.

Compare the elapsed time and CPU time of the query to determine the issue type for both servers.

Type 1: CPU-bound (runner)

If the CPU time is close, equal to, or higher than the elapsed time, you can treat it as a CPU-bound query. For example, if the elapsed time is 3000 milliseconds (ms) and the CPU time is 2900 ms, that means most of the elapsed time is spent on the CPU. Then we can say it's a CPU-bound query.

Examples of running (CPU-bound) queries:

Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
3200	3000	300000
1080	1000	20

Logical reads - reading data/index pages in cache - are most frequently the drivers of CPU utilization in SQL Server. There could be scenarios where CPU use comes from other sources: a while loop (in T-SQL or other code like XProcs or SQL CRL objects). The second example in the table illustrates such a scenario, where the majority of the CPU is not from reads.

Note

If the CPU time is greater than the duration, this indicates a parallel query is executed; multiple threads are using the CPU at the same time. For more information, see Parallel queries - runner or waiter.

Type 2: Waiting on a bottleneck (waiter)

A query is waiting on a bottleneck if the elapsed time is significantly greater than the CPU time. The elapsed time includes the time executing the query on the CPU (CPU time) and the time waiting for a resource to be released (wait time). For example, if the elapsed time is 2000 ms and the CPU time is 300 ms, the wait time is 1700 ms (2000 - 300 = 1700). For more information, see Types of Waits.

Examples of waiting queries:

Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
2000	300	28000
10080	700	80000

Parallel queries - runner or waiter

Parallel queries may use more CPU time than the overall duration. The goal of parallelism is to allow multiple threads to run parts of a query simultaneously. In one second of clock time, a query may use eight seconds of CPU time by executing eight parallel threads. Therefore, it becomes challenging to determine a CPU-bound or a waiting query based on the elapsed time and CPU time difference. However, as a general rule, follow the principles listed in the above two sections. The summary is:

If the elapsed time is much greater than the CPU time, consider it a waiter.
If the CPU time is much greater than the elapsed time, consider it a runner.

Examples of Parallel queries:

Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
1200	8100	850000
3080	12300	1500000

Step 3: Compare data from both servers, figure out the scenario, and troubleshoot the issue

Let's assume that there are two machines named Server1 and Server2. And the query runs slower on Server1 than on Server2. Compare the times from both servers and then follow the actions of the scenario that best matches yours from the following sections.

Scenario 1: The query on Server1 uses more CPU time, and the logical reads are higher on Server1 than on Server2

If the CPU time on Server1 is much greater than on Server2 and the elapsed time matches the CPU time closely on both servers, there are no major waits or bottlenecks. The increase in CPU time on Server1 is most likely caused by an increase in logical reads. A significant change in logical reads typically indicates a difference in query plans. For example:

Server	Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
Server1	3100	3000	300000
Server2	1100	1000	90200

Action: Check execution plans and environments

Compare execution plans of the query on both servers. To do this, use one of the two methods:
- Compare execution plans visually. For more information, see Display an Actual Execution Plan.
- Save the execution plans and compare them by using SQL Server Management Studio Plan Comparison feature.
Compare environments. Different environments may lead to query plan differences or direct differences in CPU usage. Environments include server versions, database or server configuration settings, trace flags, CPU count or clock speed, and Virtual Machine versus Physical Machine. See Diagnose query plan differences for details.

Scenario 2: The query is a waiter on Server1 but not on Server2

If the CPU times for the query on both servers are similar but the elapsed time on Server1 is much greater than on Server2, the query on Server1 spends a much longer time waiting on a bottleneck. For example:

Server	Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
Server1	4500	1000	90200
Server2	1100	1000	90200

Waiting time on Server1: 4500 - 1000 = 3500 ms
Waiting time on Server2: 1100 - 1000 = 100 ms

Action: Check wait types on Server1

Identify and eliminate the bottleneck on Server1. Examples of waits are blocking (lock waits), latch waits, disk I/O waits, network waits, and memory waits. To troubleshoot common bottleneck issues, proceed to Diagnose waits or bottlenecks.

Scenario 3: The queries on both servers are waiters, but the wait types or times are different

For example:

Server	Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
Server1	8000	1000	90200
Server2	3000	1000	90200

Waiting time on Server1: 8000 - 1000 = 7000 ms
Waiting time on Server2: 3000 - 1000 = 2000 ms

In this case, the CPU times are similar on both servers, which indicates query plans are likely the same. The queries would perform equally on both servers if they don't wait for the bottlenecks. So the duration differences come from the different amounts of wait time. For example, the query waits on locks on Server1 for 7000 ms while it waits on I/O on Server2 for 2000 ms.

Action: Check wait types on both servers

Address each bottleneck wait individually on each server and speed up executions on both servers. Troubleshooting this issue is labor-intensive because you need to eliminate bottlenecks on both servers and make the performance comparable. To troubleshoot common bottleneck issues, proceed to Diagnose waits or bottlenecks.

Scenario 4: The query on Server1 uses more CPU time than on Server2, but the logical reads are close

For example:

Server	Elapsed Time (ms)	CPU Time (ms)	Reads (logical)
Server1	3000	3000	90200
Server2	1000	1000	90200

If the data matches the following conditions:

The CPU time on Server1 is much greater than on Server2.
The elapsed time matches the CPU time closely on each server, which indicates no waits.
The logical reads, typically the highest driver of CPU time, are similar on both servers.

Then the additional CPU time comes from some other CPU-bound activities. This scenario is the rarest of all the scenarios.

Causes: Tracing, UDFs and CLR integration

This issue may be caused by:

XEvents/SQL Server tracing, especially with filtering on text columns (database name, login name, query text, and so on). If tracing is enabled on one server but not on the other, this could be the reason for the difference.
User-defined functions (UDFs) or other T-SQL code that performs CPU-bound operations. This would typically be the cause when other conditions are different on Server1 and Server2, such as data size, CPU clock speed, or Power plan.
SQL Server CLR integration or Extended Stored procedures (XPs) that may drive CPU but don't perform logical reads. Differences in the DLLs may lead to different CPU times.
Difference in SQL Server functionality that is CPU-bound (e.g., string-manipulation code).

Action: Check traces and queries

Check traces on both servers for the following:
1. If there's any trace enabled on Server1 but not on Server2.
2. If any trace is enabled, disable the trace and run the query again on Server1.
3. If the query runs faster this time, enable the trace back but remove text filters from it, if there are any.
Check if the query uses UDFs that do string manipulations or do extensive processing on data columns in the SELECT list.
Check if the query contains loops, function recursions, or nestings.

Diagnose environment differences

Check the following questions and determine whether the comparison between the two servers is valid.

Are the two SQL Server instances the same version or build?

If not, there could be some fixes that caused the differences. Run the following query to get version information on both servers:
```
SELECT @@VERSION
```
Is the amount of physical memory similar on both servers?

If one server has 64 GB of memory while the other has 256 GB of memory, that would be a significant difference. With more memory available to cache data/index pages and query plans, the query could be optimized differently based on hardware resource availability.
Are CPU-related hardware configurations similar on both servers? For example:
- Number of CPUs varies between machines (24 CPUs on one machine versus 96 CPUs on the other).
- Power plans—balanced versus high performance.
- Virtual Machine (VM) versus physical (bare metal) machine.
- Hyper-V versus VMware—difference in configuration.
- Clock speed difference (lower clock speed versus higher clock speed). For example, 2 GHz versus 3.5 GHz can make a difference. To get the clock speed on a server, run the following PowerShell command:
```
Get-CimInstance Win32_Processor | Select-Object -Expand MaxClockSpeed
```
Use one of the following two ways to test the CPU speed of the servers. If they don't produce comparable results, the issue is outside of SQL Server. It could be a power plan difference, fewer CPUs, VM-software issue, or clock speed difference.
- Run the following PowerShell script on both servers and compare the outputs.
```
$bf = [System.DateTime]::Now
for ($i = 0; $i -le 20000000; $i++) {}
$af = [System.DateTime]::Now
Write-Host ($af - $bf).Milliseconds " milliseconds"
Write-Host ($af - $bf).Seconds " Seconds"
```
- Run the following Transact-SQL code on both servers and compare the outputs.
```
SET NOCOUNT ON 
DECLARE @spins INT = 0
DECLARE @start_time DATETIME = GETDATE(), @time_millisecond INT

WHILE (@spins < 20000000)
BEGIN
   SET @spins = @spins +1
END

SELECT @time_millisecond = DATEDIFF(millisecond, @start_time, getdate())

SELECT @spins Spins, @time_millisecond Time_ms,  @spins / @time_millisecond Spins_Per_ms
```

Diagnose waits or bottlenecks

To optimize a query that's waiting on bottlenecks, identify how long the wait is and where the bottleneck is (the wait type). Once the wait type is confirmed, reduce the wait time or eliminate the wait completely.

To calculate the approximate wait time, subtract the CPU time (worker time) from the elapsed time of a query. Typically, the CPU time is the actual execution time, and the remaining part of the lifetime of the query is waiting.

Examples of how to calculate approximate wait duration:

Elapsed Time (ms)	CPU Time (ms)	Wait time (ms)
3200	3000	200
7080	1000	6080

Identify the bottleneck or wait

To identify historical long-waiting queries (for example, >20% of the overall elapsed time is wait time), run the following query. This query uses performance statistics for cached query plans since the start of SQL Server.

SELECT t.text,
         qs.total_elapsed_time / qs.execution_count
         AS avg_elapsed_time,
         qs.total_worker_time / qs.execution_count
         AS avg_cpu_time,
         (qs.total_elapsed_time - qs.total_worker_time) / qs.execution_count
         AS avg_wait_time,
         qs.total_logical_reads / qs.execution_count
         AS avg_logical_reads,
         qs.total_logical_writes / qs.execution_count
         AS avg_writes,
         qs.total_elapsed_time
         AS cumulative_elapsed_time
FROM sys.dm_exec_query_stats qs
         CROSS apply sys.Dm_exec_sql_text (sql_handle) t
WHERE (qs.total_elapsed_time - qs.total_worker_time) / qs.total_elapsed_time
         > 0.2
ORDER BY qs.total_elapsed_time / qs.execution_count DESC

To identify currently executing queries with waits longer than 500 ms, run the following query:

SELECT r.session_id, r.wait_type, r.wait_time AS wait_time_ms
FROM sys.dm_exec_requests r 
   JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id 
WHERE wait_time > 500
AND is_user_process = 1

If you can collect a query plan, check the WaitStats from the execution plan properties in SSMS:
1. Run the query with Include Actual Execution Plan on.
2. Right-click the left-most operator in the Execution plan tab
3. Select Properties and then WaitStats property.
4. Check the WaitTimeMs and WaitType.
If you're familiar with PSSDiag/SQLdiag or SQL LogScout LightPerf/GeneralPerf scenarios, consider using either of them to collect performance statistics and identify waiting queries on your SQL Server instance. You can import the collected data files and analyze the performance data with SQL Nexus.

References to help eliminate or reduce waits

The causes and resolutions for each wait type vary. There's no one general method to resolve all wait types. Here are articles to troubleshoot and resolve common wait type issues:

For descriptions of many Wait types and what they indicate, see the table in Types of Waits.

Diagnose query plan differences

Here are some common causes for differences in query plans:

Data size or data values differences

Is the same database being used on both servers—using the same database backup? Has the data been modified on one server compared to the other? Data differences can lead to different query plans. For example, joining table T1 (1000 rows) with table T2 (2,000,000 rows) is different from joining table T1 (100 rows) with table T2 (2,000,000 rows). The type and speed of the JOIN operation can be significantly different.
Statistics differences

Have statistics been updated on one database and not on the other? Have statistics been updated with a different sample rate (for example, 30% versus 100% full scan)? Ensure that you update statistics on both sides with the same sample rate.
Database compatibility level differences

Check if the compatibility levels of the databases are different between the two servers. To get the database compatibility level, run the following query:
```
SELECT name, compatibility_level
FROM sys.databases
WHERE name = '<YourDatabase>'
```
Server version/build differences

Are the versions or builds of SQL Server different between the two servers? For example, is one server SQL Server version 2014 and the other SQL Server version 2016? There could be product changes that can lead to changes in how a query plan is selected. Make sure you compare the same version and build of SQL Server.
```
SELECT ServerProperty('ProductVersion')
```
Cardinality Estimator (CE) version differences

Check if the legacy cardinality estimator is activated at the database level. For more information about CE, see Cardinality Estimation (SQL Server).
```
SELECT name, value, is_value_default
FROM sys.database_scoped_configurations
WHERE name = 'LEGACY_CARDINALITY_ESTIMATION'
```
Optimizer hotfixes enabled/disabled

If the query optimizer hotfixes are enabled on one server but disabled on the other, then different query plans can be generated. For more information, see SQL Server query optimizer hotfix trace flag 4199 servicing model.

To get the state of query optimizer hotfixes, run the following query:
```
-- Check at server level for TF 4199
DBCC TRACESTATUS (-1)
-- Check at database level
USE <YourDatabase>
SELECT name, value, is_value_default 
FROM sys.database_scoped_configurations
WHERE name = 'QUERY_OPTIMIZER_HOTFIXES'
```
Trace flags differences

Some trace flags affect query plan selection. Check if there are trace flags enabled on one server that aren't enabled on the other. Run the following query on both servers and compare the results:
```
-- Check at server level for trace flags
DBCC TRACESTATUS (-1)
```
Hardware differences (CPU count, Memory size)

To get the hardware information, run the following query:
```
SELECT cpu_count, physical_memory_kb/1024/1024 PhysicalMemory_GB 
FROM sys.dm_os_sys_info
```

Hardware differences according to the query optimizer

Check the OptimizerHardwareDependentProperties of a query plan and see if hardware differences are considered significant for different plans.

WITH xmlnamespaces(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
  txt.text,
  t.OptHardw.value('@EstimatedAvailableMemoryGrant', 'INT') AS EstimatedAvailableMemoryGrant , 
  t.OptHardw.value('@EstimatedPagesCached', 'INT') AS EstimatedPagesCached, 
  t.OptHardw.value('@EstimatedAvailableDegreeOfParallelism', 'INT') AS EstimatedAvailDegreeOfParallelism,
  t.OptHardw.value('@MaxCompileMemory', 'INT') AS MaxCompileMemory
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
CROSS APPLY qp.query_plan.nodes('//OptimizerHardwareDependentProperties') AS t(OptHardw)
CROSS APPLY sys.dm_exec_sql_text (CP.plan_handle) txt
WHERE text Like '%<Part of Your Query>%'

Optimizer timeout

Is there an optimizer timeout issue? The query optimizer can stop evaluating plan options if the query being executed is too complex. When it stops, it picks the plan with the lowest cost available at the time. This can lead to what seems like an arbitrary plan choice on one server versus another.
SET options

Some SET options are plan-affecting, such as SET ARITHABORT. For more information, see SET Options.
Query Hint differences

Does one query use query hints and the other not? Check the query text manually to establish the presence of query hints.
Parameter-sensitive plans (parameter sniffing issue)

Are you testing the query with the exact same parameter values? If not, then you may start there. Was the plan compiled earlier on one server based on a different parameter value? Test the two queries by using the RECOMPILE query hint to ensure there's no plan reuse taking place. For more information, see Investigate and resolve parameter-sensitive issues.

Different database options/scoped configuration settings

Are the same database options or scoped configuration settings used on both servers? Some database options may influence plan choices. For example, database compatibility, legacy CE versus default CE, and parameter sniffing. Run the following query from one server to compare the database options used on the two servers:

-- On Server1 add a linked server to Server2 
EXEC master.dbo.sp_addlinkedserver @server = N'Server2', @srvproduct=N'SQL Server'

-- Run a join between the two servers to compare settings side by side
SELECT 
   s1.name AS srv1_config_name, 
   s2.name AS srv2_config_name,
   s1.value_in_use AS srv1_value_in_use, 
   s2.value_in_use AS srv2_value_in_use, 
   Variance = CASE WHEN ISNULL(s1.value_in_use, '##') != ISNULL(s2.value_in_use,'##') THEN 'Different' ELSE '' END
FROM sys.configurations s1 
FULL OUTER JOIN [server2].master.sys.configurations s2 ON s1.name = s2.name


SELECT 
   s1.name AS srv1_config_name,
   s2.name AS srv2_config_name,
   s1.value srv1_value_in_use,
   s2.value srv2_value_in_use,
   s1.is_value_default,
   s2.is_value_default,
   Variance = CASE WHEN ISNULL(s1.value, '##') != ISNULL(s2.value, '##') THEN 'Different' ELSE '' END
FROM sys.database_scoped_configurations s1
FULL OUTER JOIN [server2].master.sys.database_scoped_configurations s2 ON s1.name = s2.name

Plan guides

Are any plan guides used for your queries on one server but not on the other? Run the following query to establish differences:
```
SELECT * FROM sys.plan_guides
```

Zdieľať cez