Monitor Azure Data Explorer performance, health, and usage with metrics

מאמר
10/09/2024

Azure Data Explorer metrics provide key indicators as to the health and performance of the Azure Data Explorer cluster resources. Use the metrics that are detailed in this article to monitor Azure Data Explorer cluster usage, health, and performance in your specific scenario as standalone metrics. You can also use metrics as the basis for operational Azure Dashboards and Azure Alerts.

For more information about Azure Metrics Explorer, see Metrics Explorer.

Prerequisites

An Azure subscription. Create a free Azure account.
An Azure Data Explorer cluster and database. Create a cluster and database.

Use metrics to monitor your Azure Data Explorer resources

Sign in to the Azure portal.
In the left-hand pane of your Azure Data Explorer cluster, search for metrics.
Select Metrics to open the metrics pane and begin analysis on your cluster.

Work in the metrics pane

In the metrics pane, select specific metrics to track, choose how to aggregate your data, and create metric charts to view on your dashboard.

The Resource and Metric Namespace pickers are pre-selected for your Azure Data Explorer cluster. The numbers in the following image correspond to the numbered list below. They guide you through different options in setting up and viewing your metrics.

Metrics pane.

To create a metric chart, select Metric name and relevant Aggregation per metric. For more information about different metrics, see supported Azure Data Explorer metrics.
Select Add metric to see multiple metrics plotted in the same chart.
Select + New chart to see multiple charts in one view.
Use the time picker to change the time range (default: past 24 hours).
Use Add filter and Apply splitting for metrics that have dimensions.
Select Pin to dashboard to add your chart configuration to the dashboards so that you can view it again.
Set New alert rule to visualize your metrics using the set criteria. The new alerting rule will include your target resource, metric, splitting, and filter dimensions from your chart. Modify these settings in the alert rule creation pane.

Supported Azure Data Explorer metrics

The Azure Data Explorer metrics give insight into both overall performance and use of your resources, as well as information about specific actions, such as ingestion or query. The metrics in this article have been grouped by usage type.

The types of metrics are:

Cluster metrics
Export metrics
Ingestion metrics
Streaming ingest metrics
Query metrics
Materialized view metrics

For an alphabetical list of Azure Monitor's metrics for Azure Data Explorers, see supported Azure Data Explorer cluster metrics.

Cluster metrics

The cluster metrics track the general health of the cluster. For example, resource and ingestion use and responsiveness.

Metric	Unit	Aggregation	Metric description	Dimensions
Cache utilization (deprecated)	Percent	Avg, Max, Min	The percentage of allocated cache resources currently in use by the cluster. Cache is the size of SSD allocated for user activity according to the defined cache policy. An average cache utilization of 80% or less is a sustainable state for a cluster. If the average cache utilization is above 80%, the cluster should be scaled up to a storage-optimized pricing tier or scaled out to more instances. Alternatively, adapt the cache policy to fewer days in cache. If cache utilization is over 100%, the size of data to be cached is larger than the total size of cache on the cluster. This metric is deprecated and presented for backward compatibility only. Use the ‘Cache utilization factor’ metric instead.	None
Cache utilization factor	Percent	Avg, Max, Min	The percentage of utilized disk space dedicated for hot cache in the cluster. 100% means that the disk space assigned to hot data is optimally utilized. No action is needed, and the cluster is completely fine. Less than 100% means that the disk space assigned for hot data is not fully utilized. More than 100% means that the cluster's disk space is not large enough to accommodate the hot data, as defined by your caching policies. To ensure that sufficient space is available for all the hot data, the amount of hot data needs to be reduced or the cluster needs to be scaled out. We recommend enabling auto scale.	None
CPU	Percent	Avg, Max, Min	The percentage of allocated compute resources currently in use by machines in the cluster. An average CPU of 80% or less is sustainable for a cluster. The maximum value of CPU is 100%, which means there are no additional compute resources to process data. When a cluster isn't performing well, check the maximum value of the CPU to determine if there are specific CPUs that are blocked.	None
Ingestion utilization	Percent	Avg, Max, Min	The percentage of actual resources used to ingest data from the total resources allocated, in the capacity policy, to perform ingestion. The default capacity policy is no more than 512 concurrent ingestion operations or 75% of the cluster resources invested in ingestion. Average ingestion utilization of 80% or less is a sustainable state for a cluster. Maximum value of ingestion utilization is 100%, which means all cluster ingestion ability is used and an ingestion queue may result.	None
InstanceCount	Count	Avg	The total instance count.
Keep alive	Count	Avg	Tracks the responsiveness of the cluster. A fully responsive cluster returns value 1 and a blocked or disconnected cluster returns 0.
Total number of throttled commands	Count	Avg, Max, Min, Sum	The number of throttled (rejected) commands in the cluster, since the maximum allowed number of concurrent (parallel) commands was reached.	None
Total number of extents	Count	Avg, Max, Min, Sum	The total number of data extents in the cluster. Changes in this metric can imply massive data structure changes and high load on the cluster, since merging data extents is a CPU-heavy activity.	None
Follower latency	Milliseconds	Avg, Max, Min	The follower databases synchronize changes in the leader databases. Because of the synchronization, there’s a data lag of a few seconds to a few minutes in data availability. This metric measures the length of the time lag. The time lag depends on several factors like: the overall size and rate of the ingested data to the leader, the number of databases followed, the rate of internal operations performed on the leader (merge/rebuild operations). This is a cluster level metrics: the followers catch metadata of all databases that are followed. This metric represents the latency of the process.	None

Export metrics

Export metrics track the general health and performance of export operations like lateness, results, number of records, and utilization.

Metric	Unit	Aggregation	Metric description	Dimensions
Continuous export number of exported records	Count	Sum	The number of exported records in all continuous export jobs.	ContinuousExportName
Continuous export max lateness	Count	Max	The lateness (in minutes) reported by the continuous export jobs in the cluster.	None
Continuous export pending count	Count	Max	The number of pending continuous export jobs. These jobs are ready to run but waiting in a queue, possibly due to insufficient capacity).
Continuous export result	Count	Count	The Failure/Success result of each continuous export run.	ContinuousExportName
Export utilization	Percent	Max	The export capacity used, out of the total export capacity in the cluster (between 0 and 100).	None

Ingestion metrics

Ingestion metrics track the general health and performance of ingestion operations like latency, results, and volume. To refine your analysis:

Apply filters to charts to plot partial data by dimensions. For example, explore ingestion to a specific Database.
Apply splitting to a chart to visualize data by different components. This process is useful for analyzing metrics that are reported by each step of the ingestion pipeline, for example Blobs received.

Metric	Unit	Aggregation	Metric description	Dimensions
Batch blob count	Count	Avg, Max, Min	The number of data sources in a completed batch for ingestion.	Database
Batch duration	Seconds	Avg, Max, Min	The duration of the batching phase in the ingestion flow.	Database
Batch size	Bytes	Avg, Max, Min	The uncompressed expected data size in an aggregated batch for ingestion.	Database
Batches processed	Count	Sum, Max, Min	The number of batches completed for ingestion. `Batching Type`: The trigger for sealing a batch. For a complete list of batching types, see Batching types.	Database, Batching Type
Blobs received	Count	Sum, Max, Min	The number of blobs received from input stream by a component. Use apply splitting to analyze each component.	Database, Component Type, Component Name
Blobs processed	Count	Sum, Max, Min	The number of blobs processed by a component. Use apply splitting to analyze each component.	Database, Component Type, Component Name
Blobs dropped	Count	Sum, Max, Min	The number of blobs permanently dropped by a component. For each such blob, an `Ingestion result` metric with a failure reason is sent. Use apply splitting to analyze each component.	Database, Component Type, Component Name
Discovery latency	Seconds	Avg	Time from data enqueue until discovery by data connections. This time isn't included in the Stage latency or in the Ingestion latency metrics. Discovery latency might increase in the following situations: When cross-region data connections are used. In Event Hubs data connections, if the number of Event Hubs partitions isn't enough for the data egress volume or if the events are unevenly distributed across partitions.	Component Type, Component Name
Events received	Count	Sum, Max, Min	The number of events received by data connections from input stream.	Component Type, Component Name
Events processed	Count	Sum, Max, Min	The number of events processed by data connections.	Component Type, Component Name
Events dropped	Count	Sum, Max, Min	The number of events permanently dropped by data connections. For each such event, an `Ingestion result` metric with a failure reason is sent.	Component Type, Component Name
Ingestion latency	Seconds	Avg, Max, Min	The latency of data ingested, from the time the data was received in the cluster until it's ready for query. The ingestion latency period depends on the ingestion scenario. `Ingestion Kind`: Streaming Ingestion or Queued Ingestion	Ingestion Kind
Ingestion result	Count	Sum	The total number of sources that either failed or succeeded to be ingested. `Status`: Success for successful ingestion or the failure category for failures. For a complete list of possible failure categories see Ingestion error codes in Azure Data Explorer. `Failure Status Type`: Whether the failure is permanent or transient. For successful ingestion, this dimension is `None`. Note: Event Hubs and IoT Hub ingestion events are pre-aggregated into one blob, and then treated as a single source to be ingested. Therefore, pre-aggregated events appear as a single ingestion result after pre-aggregation. Transient failures are retried internally a limited number of times. Each transient failure is reported as a transient ingestion result. Therefore, a single ingestion may result in more than one ingestion result.	Status, Failure Status Type
Ingestion volume (in bytes)	Count	Max, Sum	The total size of data ingested to the cluster (in bytes) before compression.	Database
Queue length	Count	Avg	The number of pending messages in a component's input queue. The batching manager component has one message per blob. The ingestion manager component has one message per batch. A batch is a single ingest command with one or more blobs.	Component Type
Queue oldest message	Seconds	Avg	The time in seconds from when the oldest message in a component's input queue has been inserted.	Component Type
Received data size bytes	Bytes	Avg, Sum	The size of data received by data connections from input stream.	Component Type, Component Name
Stage latency	Seconds	Avg	The time from when a message is accepted by Azure Data Explorer, until its content is received by an ingestion component for processing. Use apply filters and select Component Type > StorageEngine to show the total ingestion latency.	Database, Component Type

Streaming ingest metrics

Streaming ingest metrics track streaming ingestion data and request rate, duration, and results.

Metric	Unit	Aggregation	Metric description	Dimensions
Streaming Ingest Data Rate	Count	RateRequestsPerSecond	The total volume of data ingested to the cluster.	None
Streaming Ingest Duration	Milliseconds	Avg, Max, Min	The total duration of all streaming ingestion requests.	None
Streaming Ingest Request Rate	Count	Count, Avg, Max, Min, Sum	The total number of streaming ingestion requests.	None
Streaming Ingest Result	Count	Avg	The total number of streaming ingestion requests by result type.	Result

Query metrics

Query performance metrics track query duration and total number of concurrent or throttled queries.

Metric	Unit	Aggregation	Metric description	Dimensions
Query duration	Milliseconds	Avg, Min, Max, Sum	The total time until query results are received (doesn't include network latency).	QueryStatus
QueryResult	Count	Count	The total number of queries.	QueryStatus
Total number of concurrent queries	Count	Avg, Max, Min, Sum	The number of queries run in parallel in the cluster. This metric is a good way to estimate the load on the cluster.	None
Total number of throttled queries	Count	Avg, Max, Min, Sum	The number of throttled (rejected) queries in the cluster. The maximum number of concurrent (parallel) queries allowed is defined in the request rate limit policy.	None

Materialized view metrics

Metric	Unit	Aggregation	Metric description	Dimensions
MaterializedViewHealth	1, 0	Avg	The value is 1 if the view is considered healthy, otherwise 0.	Database, MaterializedViewName
MaterializedViewAgeSeconds	Seconds	Avg	The `age` of the view is defined by the current time minus the last ingestion time processed by the view. Metric value is time in seconds (the lower the value is, the view is "healthier").	Database, MaterializedViewName
MaterializedViewResult	1	Avg	The metric includes a `Result` dimension indicating the result of the last materialization cycle (see the MaterializedViewResult metric for details about possible values). Metric value always equals 1.	Database, MaterializedViewName, Result
MaterializedViewRecordsInDelta	Records count	Avg	The number of records currently in the non-processed part of the source table. For more information, see how materialized views work	Database, MaterializedViewName
MaterializedViewExtentsRebuild	Extents count	Avg	The number of extents that required updates in the materialization cycle.	Database, MaterializedViewName
MaterializedViewDataLoss	1	Max	The metric is fired when unprocessed source data is approaching retention. Indicates that the materialized view is unhealthy.	Database, MaterializedViewName, Kind

Partitioning metrics

Partitioning metrics monitor the partitioning process for tables with a partitioning policy.

Metric	Unit	Aggregation	Metric description	Dimensions
PartitioningPercentage	Percent	Avg, Min, Max	The percentage of records partitioned relative to the total number of records.	Database, Table
PartitioningPercentageHot	Percent	Avg, Min, Max	The percentage of records partitioned related to the total number of records (in 'hot' cache only).	Database, Table
ProcessedPartitionedRecords	Percent	Avg, Min, Max, Sum	The number of records partitioned in the measured time window.	Database, Table

שתף באמצעות

Monitor Azure Data Explorer performance, health, and usage with metrics

Prerequisites

Use metrics to monitor your Azure Data Explorer resources

Work in the metrics pane

Supported Azure Data Explorer metrics

Cluster metrics

Export metrics

Ingestion metrics

Streaming ingest metrics

Query metrics

Materialized view metrics

Partitioning metrics

משוב

משאבים נוספים

שתף באמצעות

Monitor Azure Data Explorer performance, health, and usage with metrics

Prerequisites

Use metrics to monitor your Azure Data Explorer resources

Work in the metrics pane

Supported Azure Data Explorer metrics

Cluster metrics

Export metrics

Ingestion metrics

Streaming ingest metrics

Query metrics

Materialized view metrics

Partitioning metrics

Related content

משוב

משאבים נוספים