Enable scan statistics for queries

Important

This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.

Often data teams are required to investigate performance or optimize queries to improve resource utilization or meet business requirements.

A new capability has been added in Trino for HDInsight on AKS that allows user to capture Scan statistics for any connector. This capability provides deeper insights into query performance profile beyond what is available in statistics produced by Trino.

You can enable this feature using session property collect_raw_scan_statistics, and by following Trino command:

SET SESSION collect_raw_scan_statistics=true

Once enabled, source operators in the query like TableScanOperator, ScanFilterAndProject etc. have statistics on data scans, the granularity is per operator instance in a pipeline.

Tip

Scan stats are helpful in identifying bottlenecks when the cluster or query is not CPU constrained, and read performance of the query needs investigation. It also helps to understand the execution profile of the query from a split level perspective.

Note

Currently, captured number of splits per worker is limited to 1000 due to size constraints of produced data. If the number of splits per worker for the query exceeds this limit, top 1000 longest running splits are returned.

How to access scan statistics

Once the session property is set, subsequent queries in the session start capturing statistics from source operators whenever they're available. There are multiple ways users can consume and analyze scan statistics generated for a query.

Query Json

The Json tab on Query details page provides the JSON representation of query, which included statistics on every stage, pipeline of the query. When the session property is set, the json includes a new field called scanStats in queryStats.operatorSummaries[*]. The array contains one object per instance of operator.

The following example shows a json for a query using hive connector and scan statistics enabled.

Note

The scan statistics summary includes splitInfo which is populated by the connector. This allows users to control what information about the store they would like to include in their custom connectors.

Screenshot showing query performance summary.

Scan Statistics UI

You can find a new tab called Scan Stats in Query details page that visualizes the statistics produced by this feature and provides insights in split grain performance of each worker. The page allows users to view trino's execution profile for the query with information like, concurrent reads over time and throughput.

Screenshot showing scan status addition.

The following example shows a page for a query with scan statistics enabled.

Screenshot showing scan status tab.

Using Microsoft Query logger**

Microsoft Query logger has builtin support for this feature. When enabled with this feature, the query logger plugin populates a table called operatorstats along with the query events table, this table is denormalized so that every operator instance is one row for each query.