Materialized views
Materialized views expose an aggregation query over a source table, or over another materialized view.
Materialized views always return an up-to-date result of the aggregation query (always fresh). Querying a materialized view is more performant than running the aggregation directly over the source table.
Note
- Review the materialized views use cases to decide whether materialized views are suitable for you.
- Materialized views have some limitations. Review the performance considerations before working with the feature.
- Consider using update policies where appropriate - see How to choose between materialized views and update policies? for more details.
- Monitor the health of your materialized views based on the recommendations in the materialized views monitoring page.
Use the following commands to manage materialized views:
.create materialized-view
.drop materialized-view
.disable | .enable materialized-view
.show materialized-views
Why use materialized views?
By investing resources (data storage, background CPU cycles) for materialized views of commonly used aggregations, you get the following benefits:
Performance improvement: Querying a materialized view commonly performs better than querying the source table for the same aggregation function(s).
Freshness: A materialized view query always returns the most up-to-date results, independent of when materialization last took place. The query combines the materialized part of the view with the records in the source table, which haven't yet been materialized (the
delta
part), always providing the most up-to-date results.Cost reduction: Querying a materialized view consumes less resources from the cluster than doing the aggregation over the source table. Retention policy of source table can be reduced if only aggregation is required. This setup reduces hot cache costs for the source table.
Materialized views use cases
The following are common scenarios that can be addressed by using a materialized view:
Update data by returning the last record per entity using
arg_max()
(aggregation function).Reduce the resolution of data by calculating periodic statistics over the raw data. Use various aggregation functions by period of time.
- For example, use
T | summarize dcount(User) by bin(Timestamp, 1d)
to maintain an up-to-date snapshot of distinct users per day.
- For example, use
Deduplicate records in a table using
take_any()
(aggregation function).- In deduplication scenarios, it might sometimes be useful to "hide" the source table with the materialized view, such that callers querying the table will query the deduplicated materialized view instead.
- You can achieve this by creating a function with same name as the source table, that will reference the view instead of the source table. Since functions override tables with same name, users calling the "table" will actually query the materialized view.
- When doing so, the materialized view definition must reference the source table using the table() function, to avoid cyclic references in the view definition:
.create materialized-view MV on table T { table('T') | summarize take_any(*) by EventId }
For examples of all use cases, see materialized view create command.
How to choose between materialized views and update policies?
Materialized views and update policies work differently and serve different use cases. Use the following guidelines to identify which one you should use:
Materialized views are suitable for aggregations, while update policies are not. Update policies run separately for each ingestion batch, and therefore can only perform aggregations within the same ingestion batch. If you require an aggregation query, always use materialized views.
Update policies are useful for data transformations, enrichments with dimension tables (usually using lookup operator) and other data manipulations that can run in the scope of a single ingestion.
Update policies run during ingestion time. Data is not available for queries, neither in source table nor in target table(s), until all update policies have run on it. Materialized views, on the other hand, are not part of the ingestion pipeline. The materialization process runs periodically in the background, post ingestion. Records in source table are available for queries before they are materialized.
Neither update policies nor materialized views are suitable for joins. Both can include joins, but they are limited to specific use cases. Namely, only when matching data from both sides of the join is available when the update policy / materialization process runs. If the matching entities are expected to be ingested to the join left and right tables during the same time, there is a chance data will be missed when the update policy / materialization runs. See more about
dimension tables
in materialized view query argument and in fact and dimension tables.- If you do need to materialize joins, which are not suitable for update policies and materialized views, you can orchestrate your own process for doing so, using orchestration tools and ingest from query commands.
How materialized views work
A materialized view is made of two components:
- A materialized part - an Azure Data Explorer table holding aggregated records from the source table, which have already been processed. This table always holds a single record per the aggregation's group-by combination.
- A delta - the newly ingested records in the source table that haven't yet been processed.
Querying the materialized view combines the materialized part with the delta part, providing an up-to-date result of the aggregation query. The offline materialization process ingests new records from the delta to the materialized table, and replaces existing records. The replacement is done by rebuilding extents that hold records to replace. If records in the delta constantly intersect with all data shards in the materialized part, each materialization cycle will require rebuilding the entire materialized part, and may not keep up with the ingestion rate. In that case, the view will become unhealthy and the delta will constantly grow. The materialized views monitoring page explains how to troubleshoot such situations.
Materialized views queries
There are 2 ways to query a materialized view:
Query the entire view: when you query the materialized view by its name, similarly to querying a table, the materialized view query combines the materialized part of the view with the records in the source table that haven't been materialized yet (the
delta
).- Querying the materialized view will always return the most up-to-date results, based on all records ingested to the source table. For more information about the materialized vs. non-materialized parts in materialized view, see how materialized views work.
- This option might not perform best as it needs to materialize the
delta
part during query time. Performance in this case depends on the view's age and the filters applied in the query. The materialized view query optimizer section includes possible ways to improve query performance when querying the entire view.
Query the materialized part only: another way of querying the view is by using the
materialized_view()
function. This option supports querying only the materialized part of the view, while specifying the max latency the user is willing to tolerate.- This option isn't guaranteed to return the most up-to-date records, but it should always be more performant than querying the entire view.
- This function is useful for scenarios in which you're willing to sacrifice some freshness for performance, for example for telemetry dashboards.
Tip
Queries over the materialized part only will always perform better than querying the entire view. Always use the materialized_view()
function when applicable for your use case.
Materialized views participate in cross-cluster or cross-database queries, but aren't included in wildcard unions or searches.
- The following examples will all include materialized views by the name
ViewName
:
cluster('cluster1').database('db').ViewName cluster('cluster1').database('*').ViewName database('*').ViewName database('DB*').ViewName database('*').materialized_view('ViewName') database('DB*').materialized_view('ViewName')
- The following examples will not include records from materialized views:
cluster('cluster1').database('db').* database('*').View* search in (*) search *
- The following examples will all include materialized views by the name
Materialized view query optimizer
When querying the entire view, the materialized part is combined with the delta
during query time. This includes aggregating the delta
and joining it with the materialized part.
- Querying the entire view will perform better if the query includes filters on the group by keys of the materialized view query. See more tips about how to create your materialized view, based on your query pattern, in the
.create materialized-view
performance tips section. - Azure Data Explorer's query optimizer chooses summarize/join strategies that are expected to improve query performance. For example, the decision on whether to shuffle the query is based on number of records in
delta
part. The following client request properties provide some control over the optimizations applied. You can test these properties with your materialized view queries and evaluate their impact on queries performance.
Client request property name | Type | Description |
---|---|---|
materialized_view_query_optimization_costbased_enabled |
bool | If set to false , disables summarize/join optimizations in materialized view queries. Will use default strategies. Default is true . |
materialized_view_shuffle |
dynamic | Force shuffling of the materialized view query, and (optionally) provide specific keys to shuffle by. See examples below. |
Examples
Query the entire view. The most recent records in source table are included:
ViewName
Query the materialized part of the view only, regardless of when it was last materialized.
materialized_view("ViewName")
Query the entire view, and provide a "hint" to use
shuffle
strategy. The most recent records in source table are included:- Example #1: shuffle based on the
Id
column (similarly to usinghint.shufflekey=Id
):
set materialized_view_shuffle = dynamic([{"Name" : "ViewName", "Keys" : [ "Id" ] }]); ViewName
- Example #2: shuffle based on all keys (similarly to using
hint.strategy=shuffle
):
set materialized_view_shuffle = dynamic([{"Name" : "ViewName" }]); ViewName
- Example #1: shuffle based on the
Performance considerations
The main contributors that can impact a materialized view health are:
Cluster resources: Like any other process running on the cluster, materialized views consume resources (CPU, memory) from the cluster. If the cluster is overloaded, adding materialized views to it may cause a degradation in the cluster's performance. Monitor your cluster's health using cluster health metrics. Optimized autoscale currently doesn't take materialized views health under consideration as part of autoscale rules.
- The materialization process is limited by the amount of memory and CPU it can consume. These limits are defined, and can be changed, in the materialized views workload group.
Overlap with materialized data: During materialization, all new records ingested to the source table since the last materialization (the delta) are processed and materialized into the view. The higher the intersection between new records and already materialized records is, the worse the performance of the materialized view will be. A materialized view works best if the number of records being updated (for example, in
arg_max
view) is a small subset of the source table. If all or most of the materialized view records need to be updated in every materialization cycle, then the materialized view might not perform well.Engine V3: Materialized views perform better on Engine V3 clusters, especially if the number of records to update in each iteration is high. If your cluster is not an Engine V3, you can create a support ticket to request migration.
Ingestion rate: There are no hard-coded limits on the data volume or ingestion rate in the source table of the materialized view. However, the recommended ingestion rate for materialized views is no more than 1-2GB/sec. Higher ingestion rates may still perform well. Performance depends on cluster size, available resources, and amount of intersection with existing data.
Number of materialized views in cluster: The above considerations apply to each individual materialized view defined in the cluster. Each view consumes its own resources, and many views will compete with each other on available resources. While there are no hard-coded limits to the number of materialized views in a cluster, the cluster may not be able to handle all materialized views, when there are many defined. The capacity policy can be adjusted if there is more than a single materialized view in the cluster. Increase the value of
ClusterMinimumConcurrentOperations
in the policy to run more materialized views concurrently.Materialized view definition: The materialized view definition must be defined according to query best practices for best query performance. For more information, see create command performance tips.
Materialized view over materialized view
A materialized view can be created over another materialized view if the source materialized view is a deduplication view. Specifically, the aggregation of the source materialized view must be take_any(*)
in order to deduplicate source records. The second materialized view can use any supported aggregation functions. For specific information on how to create a materialized view over a materialized view, see .create materialized-view
command.
Tip
When querying a materialized view that is defined over another materialized view, we recommend querying the materialized part only using the materialized_view()
function. Querying the entire view will not be performant when both views aren't fully materialized. For more information, see materialized views queries.