Configure caching
Note
We will retire Azure HDInsight on AKS on January 31, 2025. Before January 31, 2025, you will need to migrate your workloads to Microsoft Fabric or an equivalent Azure product to avoid abrupt termination of your workloads. The remaining clusters on your subscription will be stopped and removed from the host.
Only basic support will be available until the retirement date.
Important
This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.
Querying object storage using the Hive connector is a common use case for Trino. This process often involves sending large amounts of data. Objects are retrieved from HDFS or another supported object store by multiple workers and processed by those workers. Repeated queries with different parameters, or even different queries from different users, often access and transfer the same objects.
HDInsight on AKS added final result caching capability for Trino, which provides the following benefits:
- Reduce the load on object storage.
- Improve the query performance.
- Reduce the query cost.
Different options for caching:
- Final result caching: When enabled (in coordinator component configuration section), a result for any query for any catalog caches on a coordinator VM.
- Hive/Iceberg/Delta Lake catalog caching: When enabled (for a specific catalog of corresponding type), a split data for each query caches within cluster on worker VMs.
Final result caching can be configured in two ways:
Available configuration parameters are:
Property | Default | Description |
---|---|---|
query.cache.enabled |
false | Enables final result caching if true. |
query.cache.ttl |
- | Defines a time till cache data is kept prior to eviction. For example: "10m","1h" |
query.cache.disk-usage-percentage |
80 | Percentage of disk space used for cached data. |
query.cache.max-result-data-size |
0 | Max data size for a result. If this value exceeded, then result doesn't cache. |
Note
Final result caching uses query plan and ttl as a cache key.
Session parameter | Default | Description |
---|---|---|
query_cache_enabled |
Original configuration value | Enables/disables final result caching for a query/session. |
query_cache_ttl |
Original configuration value | Defines a time till cache data is kept prior to eviction. |
query_cache_max_result_data_size |
Original configuration value | Max data size for a result. If this value exceeded, then result doesn't cache. |
query_cache_forced_refresh |
false | When set to true, forces the result of query execution to be cached that is, the result replaces existing cached data if it exists). |
Note
Session parameters can be set for a session (for example, if Trino CLI is used) or can be set in multi-statement before query text. For example,
set session query_cache_enabled=true;
select cust.name, *
from tpch.tiny.orders
join tpch.tiny.customer as cust on cust.custkey = orders.custkey
order by cust.name
limit 10;
Final result caching produces JMX metrics which can be viewed using Managed Prometheus and Grafana. The following metrics are available:
Metric | Description |
---|---|
trino_cache_cachestats_requestcount |
Total number of queries going through cache layer. This number doesn't include queries executed with cache off. |
trino_cache_cachestats_hitcount |
Number of cache hits i.e. number of queries when data was available and returned from the cache. |
trino_cache_cachestats_misscount |
Number of cache misses i.e. number of queries when data wasn't available and had to be cached. |
trino_cache_cachestats_hitrate |
Percentage representation of cache hits against total number of queries. |
trino_cache_cachestats_totalevictedcount |
Number of cached queries evicted from the cache. |
trino_cache_cachestats_totalbytesfromsource |
Number of bytes read from the source. |
trino_cache_cachestats_totalbytesfromcache |
Number of bytes read from the cache. |
trino_cache_cachestats_totalcachedbytes |
Total number of bytes cached. |
trino_cache_cachestats_totalevictedbytes |
Total Number of bytes evicted. |
trino_cache_cachestats_spaceused |
Current size of the cache. |
trino_cache_cachestats_cachereadfailures |
Number of times when data can't be read from the cache due to any error. |
trino_cache_cachestats_cachewritefailures |
Number of times when data can't be written into the cache due to any error. |
Sign in to Azure portal.
In the Azure portal search bar, type "HDInsight on AKS cluster" and select "Azure HDInsight on AKS clusters" from the drop-down list.
Select your cluster name from the list page.
Navigate to Configuration Management blade.
Go to config.properties -> Custom configurations and then click Add.
Set the required properties, and click OK.
Save the configuration.
- An operational Trino cluster with HDInsight on AKS.
- Create ARM template for your cluster.
- Review complete cluster ARM template sample.
- Familiarity with ARM template authoring and deployment.
You need to define the properties in coordinator component in properties.clusterProfile.serviceConfigsProfiles
section in the ARM template.
The following example demonstrates where to add the properties.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"resources": [
{
"type": "microsoft.hdinsight/clusterpools/clusters",
"apiVersion": "<api-version>",
"name": "<cluster-pool-name>/<cluster-name>",
"location": "<region, e.g. westeurope>",
"tags": {},
"properties": {
"clusterType": "Trino",
"clusterProfile": {
"serviceConfigsProfiles": [
{
"serviceName": "trino",
"configs": [
{
"component": "coordinator",
"files": [
{
"fileName": "config.properties",
"values": {
"query.cache.enabled": "true",
"query.cache.ttl": "10m"
}
}
]
}
]
}
]
}
}
}
]
}
All three connectors share the same set of parameters as described in Hive caching.
Note
Certain parameters are not configurable and always set to their default values:
hive.cache.data-transfer-port=8898,
hive.cache.bookkeeper-port=8899,
hive.cache.location=/etc/trino/cache,
hive.cache.disk-usage-percentage=80
The following example demonstrates where to add the properties to enable Hive caching using ARM template.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"resources": [
{
"type": "microsoft.hdinsight/clusterpools/clusters",
"apiVersion": "<api-version>",
"name": "<cluster-pool-name>/<cluster-name>",
"location": "<region, e.g. westeurope>",
"tags": {},
"properties": {
"clusterType": "Trino",
"clusterProfile": {
"serviceConfigsProfiles": [
{
"serviceName": "trino",
"configs": [
{
"component": "catalogs",
"files": [
{
"fileName": "hive1.properties",
"values": {
"connector.name": "hive"
"hive.cache.enabled": "true",
"hive.cache.ttl": "5d"
}
}
]
}
]
}
]
}
}
}
]
}
Deploy the updated ARM template to reflect the changes in your cluster. Learn how to deploy an ARM template.