Metric alert rules in Container insights (preview)

Metric alerts in Azure Monitor proactively identify issues related to system resources of your Azure resources, including monitored Kubernetes clusters. Container insights provides preconfigured alert rules so that you don't have to create your own. This article describes the different types of alert rules you can create and how to enable and configure them.

Important

Container insights in Azure Monitor now supports alerts based on Prometheus metrics. If you already use alerts based on custom metrics, you should migrate to Prometheus alerts and disable the equivalent custom metric alerts.

Types of metric alert rules

There are two types of metric rules used by Container insights based on either Prometheus metrics or custom metrics. See a list of the specific alert rules for each at Alert rule details.

Alert rule type Description
Prometheus rules Alert rules that use metrics stored in Azure Monitor managed service for Prometheus (preview). There are two sets of Prometheus alert rules that you can choose to enable.

- Community alerts are handpicked alert rules from the Prometheus community. Use this set of alert rules if you don't have any other alert rules enabled.
- Recommended alerts are the equivalent of the custom metric alert rules. Use this set if you're migrating from custom metrics to Prometheus metrics and want to retain identical functionality.
Metric rules Alert rules that use custom metrics collected for your Kubernetes cluster. Use these alert rules if you're not ready to move to Prometheus metrics yet or if you want to manage your alert rules in the Azure portal.

Prometheus alert rules

Prometheus alert rules use metric data from your Kubernetes cluster sent to Azure Monitor managed service for Prometheus.

Prerequisites

Your cluster must be configured to send metrics to Azure Monitor managed service for Prometheus. For more information, see Collect Prometheus metrics with Container insights.

Enable alert rules

The only method currently available for creating Prometheus alert rules is an Azure Resource Manager template (ARM template).

  1. Download the template that includes the set of alert rules you want to enable. For a list of the rules for each, see Alert rule details.

  2. Deploy the template by using any standard methods for installing ARM templates. For guidance, see ARM template samples for Azure Monitor.

Note

Although you can create the Prometheus alert in a resource group different from the target resource, use the same resource group as your target resource.

Edit alert rules

To edit the query and threshold or configure an action group for your alert rules, edit the appropriate values in the ARM template and redeploy it by using any deployment method.

Configure alertable metrics in ConfigMaps

Perform the following steps to configure your ConfigMap configuration file to override the default utilization thresholds. These steps only apply to the following alertable metrics:

  • cpuExceededPercentage
  • cpuThresholdViolated
  • memoryRssExceededPercentage
  • memoryRssThresholdViolated
  • memoryWorkingSetExceededPercentage
  • memoryWorkingSetThresholdViolated
  • pvUsageExceededPercentage
  • pvUsageThresholdViolated

Tip

Download the new ConfigMap from this GitHub content.

  1. Edit the ConfigMap YAML file under the section [alertable_metrics_configuration_settings.container_resource_utilization_thresholds] or [alertable_metrics_configuration_settings.pv_utilization_thresholds].

    • Example: Use the following ConfigMap configuration to modify the cpuExceededPercentage threshold to 90%:

      [alertable_metrics_configuration_settings.container_resource_utilization_thresholds]
          # Threshold for container cpu, metric will be sent only when cpu utilization exceeds or becomes equal to the following percentage
          container_cpu_threshold_percentage = 90.0
          # Threshold for container memoryRss, metric will be sent only when memory rss exceeds or becomes equal to the following percentage
          container_memory_rss_threshold_percentage = 95.0
          # Threshold for container memoryWorkingSet, metric will be sent only when memory working set exceeds or becomes equal to the following percentage
          container_memory_working_set_threshold_percentage = 95.0
      
    • Example: Use the following ConfigMap configuration to modify the pvUsageExceededPercentage threshold to 80%:

      [alertable_metrics_configuration_settings.pv_utilization_thresholds]
          # Threshold for persistent volume usage bytes, metric will be sent only when persistent volume utilization exceeds or becomes equal to the following percentage
          pv_usage_threshold_percentage = 80.0
      
  2. Run the following kubectl command: kubectl apply -f <configmap_yaml_file.yaml>.

    Example: kubectl apply -f container-azm-ms-agentconfig.yaml.

The configuration change can take a few minutes to finish before it takes effect. Then all omsagent pods in the cluster will restart. The restart is a rolling restart for all omsagent pods, so they don't all restart at the same time. When the restarts are finished, a message similar to the following example includes the result: configmap "container-azm-ms-agentconfig" created.

Metric alert rules

Metric alert rules use custom metric data from your Kubernetes cluster.

Prerequisites

Enable and configure alert rules

Enable alert rules

  1. On the Insights menu for your cluster, select Recommended alerts.

    Screenshot that shows recommended alerts option in Container insights.

  2. Toggle the Status for each alert rule to enable. The alert rule is created and the rule name updates to include a link to the new alert resource.

    Screenshot that shows a list of recommended alerts and options for enabling each.

  3. Alert rules aren't associated with an action group to notify users that an alert has been triggered. Select No action group assigned to open the Action Groups page. Specify an existing action group or create an action group by selecting Create action group.

    Screenshot that shows selecting an action group.

Edit alert rules

To edit the threshold for a rule or configure an action group for your Azure Kubernetes Service (AKS) cluster.

  1. From Container insights for your cluster, select Recommended alerts.
  2. Select the Rule Name to open the alert rule.
  3. See Create an alert rule for information on the alert rule settings.

Disable alert rules

  1. From Container insights for your cluster, select Recommended alerts.
  2. Change the status for the alert rule to Disabled.

Alert rule details

The following sections present information on the alert rules provided by Container insights.

Community alert rules

These handpicked alerts come from the Prometheus community. Source code for these mixin alerts can be found in GitHub:

  • KubeJobNotCompleted
  • KubeJobFailed
  • KubePodCrashLooping
  • KubePodNotReady
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut
  • KubeQuotaAlmostFull
  • KubeMemoryQuotaOvercommit
  • KubeCPUQuotaOvercommit
  • KubeVersionMismatch
  • KubeNodeNotReady
  • KubeNodeReadinessFlapping
  • KubeletTooManyPods
  • KubeNodeUnreachable

The following table lists the recommended alert rules that you can enable for either Prometheus metrics or custom metrics. Source code for the recommended alerts can be found in GitHub:

Prometheus alert name Custom metric alert name Description Default threshold
Average container CPU % Average container CPU % Calculates average CPU used per container. 95%
Average container working set memory % Average container working set memory % Calculates average working set memory used per container. 95%
Average CPU % Average CPU % Calculates average CPU used per node. 80%
Average Disk Usage % Average Disk Usage % Calculates average disk usage for a node. 80%
Average Persistent Volume Usage % Average Persistent Volume Usage % Calculates average persistent volume usage per pod. 80%
Average Working set memory % Average Working set memory % Calculates average Working set memory for a node. 80%
Restarting container count Restarting container count Calculates number of restarting containers. 0
Failed Pod Counts Failed Pod Counts Calculates number of restarting containers. 0
Node NotReady status Node NotReady status Calculates if any node is in NotReady state. 0
OOM Killed Containers OOM Killed Containers Calculates number of OOM killed containers. 0
Pods ready % Pods ready % Calculates the average ready state of pods. 80%
Completed job count Completed job count Calculates number of jobs completed more than six hours ago. 0

Note

The recommended alert rules in the Azure portal also include a log alert rule called Daily Data Cap Breach. This rule alerts when the total data ingestion to your Log Analytics workspace exceeds the designated quota. This alert rule isn't included with the Prometheus alert rules.

You can create this rule on your own by creating a log alert rule that uses the query _LogOperation | where Operation == "Data collection Status" | where Detail contains "OverQuota".

Common properties across all these alert rules include:

  • All alert rules are evaluated once per minute, and they look back at the last five minutes of data.
  • All alert rules are disabled by default.
  • Alerts rules don't have an action group assigned to them by default. To add an action group to the alert, either select an existing action group or create a new action group while you edit the alert rule.
  • You can modify the threshold for alert rules by directly editing the template and redeploying it. Refer to the guidance provided in each alert rule before you modify its threshold.

The following metrics have unique behavior characteristics:

Prometheus and custom metrics

  • The completedJobsCount metric is only sent when there are jobs that are completed greater than six hours ago.
  • The containerRestartCount metric is only sent when there are containers restarting.
  • The oomKilledContainerCount metric is only sent when there are OOM killed containers.
  • The cpuExceededPercentage, memoryRssExceededPercentage, and memoryWorkingSetExceededPercentage metrics are sent when the CPU, memory RSS, and memory working set values exceed the configured threshold. The default threshold is 95%. The cpuThresholdViolated, memoryRssThresholdViolated, and memoryWorkingSetThresholdViolated metrics are equal to 0 if the usage percentage is below the threshold and are equal to 1 if the usage percentage is above the threshold. These thresholds are exclusive of the alert condition threshold specified for the corresponding alert rule.
  • The pvUsageExceededPercentage metric is sent when the persistent volume usage percentage exceeds the configured threshold. The default threshold is 60%. The pvUsageThresholdViolated metric is equal to 0 when the persistent volume usage percentage is below the threshold and is equal to 1 if the usage is above the threshold. This threshold is exclusive of the alert condition threshold specified for the corresponding alert rule.
  • The pvUsageExceededPercentage metric is sent when the persistent volume usage percentage exceeds the configured threshold. The default threshold is 60%. The pvUsageThresholdViolated metric is equal to 0 when the persistent volume usage percentage is below the threshold and is equal to 1 if the usage is above the threshold. This threshold is exclusive of the alert condition threshold specified for the corresponding alert rule.

Prometheus only

  • If you want to collect pvUsageExceededPercentage and analyze it from metrics explorer, configure the threshold to a value lower than your alerting threshold. The configuration related to the collection settings for persistent volume utilization thresholds can be overridden in the ConfigMaps file under the section alertable_metrics_configuration_settings.pv_utilization_thresholds. For details related to configuring your ConfigMap configuration file, see Configure alertable metrics ConfigMaps. Collection of persistent volume metrics with claims in the kube-system namespace are excluded by default. To enable collection in this namespace, use the section [metric_collection_settings.collect_kube_system_pv_metrics] in the ConfigMap file. For more information, see Metric collection settings.
  • The cpuExceededPercentage, memoryRssExceededPercentage, and memoryWorkingSetExceededPercentage metrics are sent when the CPU, memory RSS, and Memory Working set values exceed the configured threshold. The default threshold is 95%. The cpuThresholdViolated, memoryRssThresholdViolated, and memoryWorkingSetThresholdViolated metrics are equal to 0 if the usage percentage is below the threshold and are equal to 1 if the usage percentage is above the threshold. These thresholds are exclusive of the alert condition threshold specified for the corresponding alert rule. If you want to collect these metrics and analyze them from metrics explorer, configure the threshold to a value lower than your alerting threshold. The configuration related to the collection settings for their container resource utilization thresholds can be overridden in the ConfigMaps file under the section [alertable_metrics_configuration_settings.container_resource_utilization_thresholds]. For details related to configuring your ConfigMap configuration file, see the section Configure alertable metrics ConfigMaps.

View alerts

View fired alerts for your cluster from Alerts in the Monitor menu in the Azure portal with other fired alerts in your subscription. You can also select View in alerts on the Recommended alerts pane to view alerts from custom metrics.

Note

Currently, Prometheus alerts won't be displayed when you select Alerts from your AKS cluster because the alert rule doesn't use the cluster as its target.

Next steps