Configure caching

Note

We will retire Azure HDInsight on AKS on January 31, 2025. Before January 31, 2025, you will need to migrate your workloads to Microsoft Fabric or an equivalent Azure product to avoid abrupt termination of your workloads. The remaining clusters on your subscription will be stopped and removed from the host.

Only basic support will be available until the retirement date.

Important

This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.

Querying object storage using the Hive connector is a common use case for Trino. This process often involves sending large amounts of data. Objects are retrieved from HDFS or another supported object store by multiple workers and processed by those workers. Repeated queries with different parameters, or even different queries from different users, often access and transfer the same objects.

HDInsight on AKS added final result caching capability for Trino, which provides the following benefits:

  • Reduce the load on object storage.
  • Improve the query performance.
  • Reduce the query cost.

Caching options

Different options for caching:

  • Final result caching: When enabled (in coordinator component configuration section), a result for any query for any catalog caches on a coordinator VM.
  • Hive/Iceberg/Delta Lake catalog caching: When enabled (for a specific catalog of corresponding type), a split data for each query caches within cluster on worker VMs.

Final result caching

Final result caching can be configured in two ways:

Available configuration parameters are:

Property Default Description
query.cache.enabled false Enables final result caching if true.
query.cache.ttl - Defines a time till cache data is kept prior to eviction. For example: "10m","1h"
query.cache.disk-usage-percentage 80 Percentage of disk space used for cached data.
query.cache.max-result-data-size 0 Max data size for a result. If this value exceeded, then result doesn't cache.

Note

Final result caching uses query plan and ttl as a cache key.

Final result caching can also be controlled through the following session parameters:

Session parameter Default Description
query_cache_enabled Original configuration value Enables/disables final result caching for a query/session.
query_cache_ttl Original configuration value Defines a time till cache data is kept prior to eviction.
query_cache_max_result_data_size Original configuration value Max data size for a result. If this value exceeded, then result doesn't cache.
query_cache_forced_refresh false When set to true, forces the result of query execution to be cached that is, the result replaces existing cached data if it exists).

Note

Session parameters can be set for a session (for example, if Trino CLI is used) or can be set in multi-statement before query text. For example,

set session query_cache_enabled=true;
select cust.name, *
from tpch.tiny.orders 
join tpch.tiny.customer as cust on cust.custkey = orders.custkey
order by cust.name
limit 10;

Final result caching produces JMX metrics which can be viewed using Managed Prometheus and Grafana. The following metrics are available:

Metric Description
trino_cache_cachestats_requestcount Total number of queries going through cache layer. This number doesn't include queries executed with cache off.
trino_cache_cachestats_hitcount Number of cache hits i.e. number of queries when data was available and returned from the cache.
trino_cache_cachestats_misscount Number of cache misses i.e. number of queries when data wasn't available and had to be cached.
trino_cache_cachestats_hitrate Percentage representation of cache hits against total number of queries.
trino_cache_cachestats_totalevictedcount Number of cached queries evicted from the cache.
trino_cache_cachestats_totalbytesfromsource Number of bytes read from the source.
trino_cache_cachestats_totalbytesfromcache Number of bytes read from the cache.
trino_cache_cachestats_totalcachedbytes Total number of bytes cached.
trino_cache_cachestats_totalevictedbytes Total Number of bytes evicted.
trino_cache_cachestats_spaceused Current size of the cache.
trino_cache_cachestats_cachereadfailures Number of times when data can't be read from the cache due to any error.
trino_cache_cachestats_cachewritefailures Number of times when data can't be written into the cache due to any error.

Using Azure portal

  1. Sign in to Azure portal.

  2. In the Azure portal search bar, type "HDInsight on AKS cluster" and select "Azure HDInsight on AKS clusters" from the drop-down list.

    Screenshot showing search option for getting started with HDInsight on AKS Cluster.

  3. Select your cluster name from the list page.

    Screenshot showing selecting the HDInsight on AKS Cluster you require from the list.

  4. Navigate to Configuration Management blade.

    Screenshot showing Azure portal configuration management.

  5. Go to config.properties -> Custom configurations and then click Add.

    Screenshot showing custom configuration.

  6. Set the required properties, and click OK.

    Screenshot showing configuration properties.

  7. Save the configuration.

    Screenshot showing how to save the configuration.

Using ARM template

Prerequisites

You need to define the properties in coordinator component in properties.clusterProfile.serviceConfigsProfiles section in the ARM template. The following example demonstrates where to add the properties.

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {},
    "resources": [
        {
            "type": "microsoft.hdinsight/clusterpools/clusters",
            "apiVersion": "<api-version>",
            "name": "<cluster-pool-name>/<cluster-name>",
            "location": "<region, e.g. westeurope>",
            "tags": {},
            "properties": {
                "clusterType": "Trino",

                "clusterProfile": {

                    "serviceConfigsProfiles": [
                        {
                            "serviceName": "trino",
                            "configs": [
                                {
                                    "component": "coordinator",
                                    "files": [
                                        {
                                            "fileName": "config.properties",
                                            "values": {
                                                "query.cache.enabled": "true",
                                                "query.cache.ttl": "10m"
                                            }
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }

            }
        }
    ]
}

Hive/Iceberg/Delta Lake caching

All three connectors share the same set of parameters as described in Hive caching.

Note

Certain parameters are not configurable and always set to their default values:
hive.cache.data-transfer-port=8898,
hive.cache.bookkeeper-port=8899,
hive.cache.location=/etc/trino/cache,
hive.cache.disk-usage-percentage=80

The following example demonstrates where to add the properties to enable Hive caching using ARM template.

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {},
    "resources": [
        {
            "type": "microsoft.hdinsight/clusterpools/clusters",
            "apiVersion": "<api-version>",
            "name": "<cluster-pool-name>/<cluster-name>",
            "location": "<region, e.g. westeurope>",
            "tags": {},
            "properties": {
                "clusterType": "Trino",

                "clusterProfile": {

                    "serviceConfigsProfiles": [
                        {
                            "serviceName": "trino",
                            "configs": [
                                {
                                    "component": "catalogs",
                                    "files": [
                                        {
                                            "fileName": "hive1.properties",
                                            "values": {
                                                "connector.name": "hive"
                                                "hive.cache.enabled": "true",
                                                "hive.cache.ttl": "5d"
                                            }
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }

            }
        }
    ]
}

Deploy the updated ARM template to reflect the changes in your cluster. Learn how to deploy an ARM template.