Metric alert rules in Container insights (preview)
Metric alerts in Azure Monitor proactively identify issues related to system resources of your Azure resources, including monitored Kubernetes clusters. Container insights provides preconfigured alert rules so that you don't have to create your own. This article describes the different types of alert rules you can create and how to enable and configure them.
Important
Container insights in Azure Monitor now supports alerts based on Prometheus metrics, and metric rules will be retired on March 14, 2026. If you already use alerts based on custom metrics, you should migrate to Prometheus alerts and disable the equivalent custom metric alerts.
Types of metric alert rules
There are two types of metric rules used by Container insights based on either Prometheus metrics or custom metrics. See a list of the specific alert rules for each at Alert rule details.
Alert rule type | Description |
---|---|
Prometheus rules | Alert rules that use metrics stored in Azure Monitor managed service for Prometheus (preview). There are two sets of Prometheus alert rules that you can choose to enable. - Community alerts are handpicked alert rules from the Prometheus community. Use this set of alert rules if you don't have any other alert rules enabled. - Recommended alerts are the equivalent of the custom metric alert rules. Use this set if you're migrating from custom metrics to Prometheus metrics and want to retain identical functionality. |
Metric rules | Alert rules that use custom metrics collected for your Kubernetes cluster. Use these alert rules if you're not ready to move to Prometheus metrics yet or if you want to manage your alert rules in the Azure portal. Metric rules will be retired on March 14, 2026. |
Prometheus alert rules
Prometheus alert rules use metric data from your Kubernetes cluster sent to Azure Monitor managed service for Prometheus.
Prerequisites
Your cluster must be configured to send metrics to Azure Monitor managed service for Prometheus. For more information, see Collect Prometheus metrics with Container insights.
Enable Prometheus alert rules
The only method currently available for creating Prometheus alert rules is an Azure Resource Manager template (ARM template).
Download the template that includes the set of alert rules you want to enable. For a list of the rules for each, see Alert rule details.
Deploy the template by using any standard methods for installing ARM templates. For guidance, see ARM template samples for Azure Monitor.
Note
Although you can create the Prometheus alert in a resource group different from the target resource, use the same resource group as your target resource.
Edit Prometheus alert rules
To edit the query and threshold or configure an action group for your alert rules, edit the appropriate values in the ARM template and redeploy it by using any deployment method.
Configure alertable metrics in ConfigMaps
Perform the following steps to configure your ConfigMap configuration file to override the default utilization thresholds. These steps only apply to the following alertable metrics:
- cpuExceededPercentage
- cpuThresholdViolated
- memoryRssExceededPercentage
- memoryRssThresholdViolated
- memoryWorkingSetExceededPercentage
- memoryWorkingSetThresholdViolated
- pvUsageExceededPercentage
- pvUsageThresholdViolated
Tip
Download the new ConfigMap from this GitHub content.
Edit the ConfigMap YAML file under the section
[alertable_metrics_configuration_settings.container_resource_utilization_thresholds]
or[alertable_metrics_configuration_settings.pv_utilization_thresholds]
.Example: Use the following ConfigMap configuration to modify the
cpuExceededPercentage
threshold to 90%:[alertable_metrics_configuration_settings.container_resource_utilization_thresholds] # Threshold for container cpu, metric will be sent only when cpu utilization exceeds or becomes equal to the following percentage container_cpu_threshold_percentage = 90.0 # Threshold for container memoryRss, metric will be sent only when memory rss exceeds or becomes equal to the following percentage container_memory_rss_threshold_percentage = 95.0 # Threshold for container memoryWorkingSet, metric will be sent only when memory working set exceeds or becomes equal to the following percentage container_memory_working_set_threshold_percentage = 95.0
Example: Use the following ConfigMap configuration to modify the
pvUsageExceededPercentage
threshold to 80%:[alertable_metrics_configuration_settings.pv_utilization_thresholds] # Threshold for persistent volume usage bytes, metric will be sent only when persistent volume utilization exceeds or becomes equal to the following percentage pv_usage_threshold_percentage = 80.0
Run the following kubectl command:
kubectl apply -f <configmap_yaml_file.yaml>
.Example:
kubectl apply -f container-azm-ms-agentconfig.yaml
.
The configuration change can take a few minutes to finish before it takes effect. Then all omsagent pods in the cluster will restart. The restart is a rolling restart for all omsagent pods, so they don't all restart at the same time. When the restarts are finished, a message similar to the following example includes the result: configmap "container-azm-ms-agentconfig" created
.
Metric alert rules
Important
Metric alerts (preview) are retiring and no longer recommended. Please refer to the migration guidance at Migrate from Container insights recommended alerts to Prometheus recommended alert rules (preview).
Prerequisites
- You might need to enable collection of custom metrics for your cluster. See Metrics collected by Container insights.
- See the supported regions for custom metrics at Supported regions.
Enable and configure metric alert rules
Enable metric alert rules
On the Insights menu for your cluster, select Recommended alerts.
Toggle the Status for each alert rule to enable. The alert rule is created and the rule name updates to include a link to the new alert resource.
Alert rules aren't associated with an action group to notify users that an alert has been triggered. Select No action group assigned to open the Action Groups page. Specify an existing action group or create an action group by selecting Create action group.
Edit metric alert rules
To edit the threshold for a rule or configure an action group for your Azure Kubernetes Service (AKS) cluster.
- From Container insights for your cluster, select Recommended alerts.
- Select the Rule Name to open the alert rule.
- See Create an alert rule for information on the alert rule settings.
Disable metric alert rules
- From Container insights for your cluster, select Recommended alerts.
- Change the status for the alert rule to Disabled.
Migrate from metric rules to Prometheus rules (preview)
If you're using metric alert rules to monitor your Kubernetes cluster, you should transition to Prometheus recommended alert rules (preview) before March 14, 2026 when metric alerts are retired.
- Follow the steps at Enable Prometheus alert rules to configure Prometheus recommended alert rules (preview).
- Follow the steps at Disable metric alert rules to remove metric alert rules from your clusters.
Alert rule details
The following sections present information on the alert rules provided by Container insights.
Community alert rules
These handpicked alerts come from the Prometheus community. Source code for these mixin alerts can be found in GitHub:
Alert name | Description | Default threshold |
---|---|---|
NodeFilesystemSpaceFillingUp | An extrapolation algorithm predicts that disk space usage for a node on a device in a cluster will run out of space within the upcoming 24 hours. | NA |
NodeFilesystemSpaceUsageFull85Pct | Disk space usage for a node on a device in a cluster is greater than 85%. | 85% |
KubePodCrashLooping | Pod is in CrashLoop which means the app dies or is unresponsive and kubernetes tries to restart it automatically. | NA |
KubePodNotReady | Pod has been in a non-ready state for more than 15 minutes. | NA |
KubeDeploymentReplicasMismatch | Deployment has not matched the expected number of replicas. | NA |
KubeStatefulSetReplicasMismatch | StatefulSet has not matched the expected number of replicas. | NA |
KubeJobNotCompleted | Job is taking more than 1h to complete. | NA |
KubeJobFailed | Job failed complete. | NA |
KubeHpaReplicasMismatch | Horizontal Pod Autoscaler has not matched the desired number of replicas for longer than 15 minutes. | NA |
KubeHpaMaxedOut | Horizontal Pod Autoscaler has been running at max replicas for longer than 15 minutes. | NA |
KubeCPUQuotaOvercommit | Cluster has overcommitted CPU resource requests for Namespaces and cannot tolerate node failure. | 1.5 |
KubeMemoryQuotaOvercommit | Cluster has overcommitted memory resource requests for Namespaces. | 1.5 |
KubeQuotaAlmostFull | Cluster reaches to the allowed limits for given namespace. | Between 0.9 and 1 |
KubeVersionMismatch | Different semantic versions of Kubernetes components running. | NA |
KubeNodeNotReady | KubeNodeNotReady alert is fired when a Kubernetes node is not in Ready state for a certain period. | NA |
KubeNodeUnreachable | Kubernetes node is unreachable and some workloads may be rescheduled. | NA |
KubeletTooManyPods | The alert fires when a specific node is running >95% of its capacity of pods | 0.95 |
KubeNodeReadinessFlapping | The readiness status of node has changed few times in the last 15 minutes. | 2 |
Recommended alert rules
The following table lists the recommended alert rules that you can enable for either Prometheus metrics or custom metrics. Source code for the recommended alerts can be found in GitHub:
Prometheus alert name | Custom metric alert name | Description | Default threshold |
---|---|---|---|
Average container CPU % | Average container CPU % | Calculates average CPU used per container. | 95% |
Average container working set memory % | Average container working set memory % | Calculates average working set memory used per container. | 95% |
Average CPU % | Average CPU % | Calculates average CPU used per node. | 80% |
Average Disk Usage % | Average Disk Usage % | Calculates average disk usage for a node. | 80% |
Average Persistent Volume Usage % | Average Persistent Volume Usage % | Calculates average persistent volume usage per pod. | 80% |
Average Working set memory % | Average Working set memory % | Calculates average Working set memory for a node. | 80% |
Restarting container count | Restarting container count | Calculates number of restarting containers. | 0 |
Failed Pod Counts | Failed Pod Counts | Calculates number of restarting containers. | 0 |
Node NotReady status | Node NotReady status | Calculates if any node is in NotReady state. | 0 |
OOM Killed Containers | OOM Killed Containers | Calculates number of OOM killed containers. | 0 |
Pods ready % | Pods ready % | Calculates the average ready state of pods. | 80% |
Completed job count | Completed job count | Calculates number of jobs completed more than six hours ago. | 0 |
Note
The recommended alert rules in the Azure portal also include a log alert rule called Daily Data Cap Breach. This rule alerts when the total data ingestion to your Log Analytics workspace exceeds the designated quota. This alert rule isn't included with the Prometheus alert rules.
You can create this rule on your own by creating a log alert rule that uses the query _LogOperation | where Operation == "Data collection Status" | where Detail contains "OverQuota"
.
Common properties across all these alert rules include:
- All alert rules are evaluated once per minute, and they look back at the last five minutes of data.
- All alert rules are disabled by default.
- Alerts rules don't have an action group assigned to them by default. To add an action group to the alert, either select an existing action group or create a new action group while you edit the alert rule.
- You can modify the threshold for alert rules by directly editing the template and redeploying it. Refer to the guidance provided in each alert rule before you modify its threshold.
The following metrics have unique behavior characteristics:
Prometheus and custom metrics
- The
completedJobsCount
metric is only sent when there are jobs that are completed greater than six hours ago. - The
containerRestartCount
metric is only sent when there are containers restarting. - The
oomKilledContainerCount
metric is only sent when there are OOM killed containers. - The
cpuExceededPercentage
,memoryRssExceededPercentage
, andmemoryWorkingSetExceededPercentage
metrics are sent when the CPU, memory RSS, and memory working set values exceed the configured threshold. The default threshold is 95%. ThecpuThresholdViolated
,memoryRssThresholdViolated
, andmemoryWorkingSetThresholdViolated
metrics are equal to 0 if the usage percentage is below the threshold and are equal to 1 if the usage percentage is above the threshold. These thresholds are exclusive of the alert condition threshold specified for the corresponding alert rule. - The
pvUsageExceededPercentage
metric is sent when the persistent volume usage percentage exceeds the configured threshold. The default threshold is 60%. ThepvUsageThresholdViolated
metric is equal to 0 when the persistent volume usage percentage is below the threshold and is equal to 1 if the usage is above the threshold. This threshold is exclusive of the alert condition threshold specified for the corresponding alert rule. - The
pvUsageExceededPercentage
metric is sent when the persistent volume usage percentage exceeds the configured threshold. The default threshold is 60%. ThepvUsageThresholdViolated
metric is equal to 0 when the persistent volume usage percentage is below the threshold and is equal to 1 if the usage is above the threshold. This threshold is exclusive of the alert condition threshold specified for the corresponding alert rule.
Prometheus only
- If you want to collect
pvUsageExceededPercentage
and analyze it from metrics explorer, configure the threshold to a value lower than your alerting threshold. The configuration related to the collection settings for persistent volume utilization thresholds can be overridden in the ConfigMaps file under the sectionalertable_metrics_configuration_settings.pv_utilization_thresholds
. For details related to configuring your ConfigMap configuration file, see Configure alertable metrics ConfigMaps. Collection of persistent volume metrics with claims in thekube-system
namespace are excluded by default. To enable collection in this namespace, use the section[metric_collection_settings.collect_kube_system_pv_metrics]
in the ConfigMap file. For more information, see Metric collection settings. - The
cpuExceededPercentage
,memoryRssExceededPercentage
, andmemoryWorkingSetExceededPercentage
metrics are sent when the CPU, memory RSS, and Memory Working set values exceed the configured threshold. The default threshold is 95%. ThecpuThresholdViolated
,memoryRssThresholdViolated
, andmemoryWorkingSetThresholdViolated
metrics are equal to 0 if the usage percentage is below the threshold and are equal to 1 if the usage percentage is above the threshold. These thresholds are exclusive of the alert condition threshold specified for the corresponding alert rule. If you want to collect these metrics and analyze them from metrics explorer, configure the threshold to a value lower than your alerting threshold. The configuration related to the collection settings for their container resource utilization thresholds can be overridden in the ConfigMaps file under the section[alertable_metrics_configuration_settings.container_resource_utilization_thresholds]
. For details related to configuring your ConfigMap configuration file, see the section Configure alertable metrics ConfigMaps.
View alerts
View fired alerts for your cluster from Alerts in the Monitor menu in the Azure portal with other fired alerts in your subscription. You can also select View in alerts on the Recommended alerts pane to view alerts from custom metrics.
Note
Currently, Prometheus alerts won't be displayed when you select Alerts from your AKS cluster because the alert rule doesn't use the cluster as its target.
Next steps
- Read about the different alert rule types in Azure Monitor.
- Read about alerting rule groups in Azure Monitor managed service for Prometheus.
Feedback
Submit and view feedback for