Performance: Investigate Inconsistent Query Execution Times
This post applies to both APS and Azure SQL DW
There is a reasonable expectation that if a query is executed in a controlled environment multiple times it will have minimal variance in total execution time when no changes are made to the system, data or query. When a variance is experienced it should be investigated to find the cause. In my experience with PDW a variance of less than 5% is not necessarily an indication of an underlying issue unless there are other contributing factors to make one believe so.
Any experienced variance could be due to a changes to the data, changes to the query, or concurrent workload. Its best to start with the basics. First confirm the query itself is indeed identical to previous runs. The slightest change such as adding or removing columns in the select list or a slight change in predicate can have a significant impact on query performance.
Next I like to look at the two extremes as it will be the easiest to find the differences. Pick the fastest and slowest execution you can find. First check to make sure the query was not suspended waiting on any resource locks. Details can be found in this post.
If the two executions had similar wait times for concurrency and object locks, next you need to drill into the execution of the query.
Compare the overall execution for different runs of the same query: SELECT fast.step_index, fast.operation_type, fast.total_elapsed_time AS fast_total, slow.total_elapsed_time AS slow_total, slow.total_elapsed_time - fast.total_elapsed_time AS time_delta, fast.row_count AS fast_row_count, slow.row_count AS slow_row_count, slow.row_count - fast.row_count AS row_count_delta FROM (SELECT * FROM sys.dm_pdw_request_steps WHERE request_id = '<fast_request_id>') fast INNER JOIN (SELECT * FROM sys.dm_pdw_request_steps WHERE request_id = '<slow_request_id>') slow ON fast.step_index = slow.step_index ORDER BY step_index ASC
The operation list should be identical between two identical queries. Pick the operation with the most variance and compare between two request_ids. Depending on the type of operation will determine which DMV will have the appropriate data:
Data movement step:
SELECT fast.pdw_node_id, fast.distribution_id, fast.type, fast.total_elapsed_time AS fast_time_total, slow.total_elapsed_time AS slow__time_total, slow.total_elapsed_time - fast.total_elapsed_time AS time_delta, slow.rows_processed - fast.rows_processed AS row_count_delta, slow.cpu_time - fast.cpu_time AS CPU_time_delta, slow.query_time - fast.query_time AS query_time_delta, slow.bytes_processed - fast.bytes_processed AS bytes_processed_delta FROM (SELECT * FROM sys.dm_pdw_dms_workers WHERE request_id = '<fast query id>' AND step_index = <step index>) FAST INNER JOIN (SELECT * FROM sys.dm_pdw_dms_workers WHERE request_id = '<slow query id>' AND step_index = <step_index>) SLOW ON FAST.pdw_node_id = slow.pdw_node_id AND fast.distribution_id = slow.distribution_id AND fast.type = slow.type
Other operation:
SELECT fast.step_index, fast.pdw_node_id, fast.distribution_id, fast.total_elapsed_time AS fast_time, slow.total_elapsed_time AS slow_time, slow.total_elapsed_time - fast.total_elapsed_time AS time_delta FROM (SELECT * FROM sys.dm_pdw_sql_requests WHERE request_id = '<fast query id>') fast INNER JOIN (SELECT * FROM sys.dm_pdw_sql_requests WHERE request_id = '<slow query id>') slow ON fast.distribution_id = slow.distribution_id
Pay attention to start and end times for each distribution as well as rows and bytes processed to look for any discrepancies. Concurrency can significantly impact performance by creating a bottleneck on a system resource but will not be evident in these results. It is always best to baseline a query with no concurrent executions if possible.
The goal of this article is not to identify a specific reason query execution times vary, but rather how to access the data detailing the execution and identify areas which require further investigation. Once an anomaly is identified, further investigation may be needed.