Monitor Azure Kubernetes Service (AKS)

When you have critical applications and business processes relying on Azure resources, you want to monitor those resources for their availability, performance, and operation. This article describes the monitoring data generated by AKS and analyzed with Azure Monitor. If you are unfamiliar with the features of Azure Monitor common to all Azure services that use it, read Monitoring Azure resources with Azure Monitor.

Important

This article provides basic information for getting started monitoring an AKS cluster. For complete monitoring of Kuberenetes clusters in Azure Container insights, see Monitor Azure Kubernetes Service (AKS) with Azure Monitor.

Monitoring data

AKS generates the same kinds of monitoring data as other Azure resources that are described in Monitoring data from Azure resources. See Monitoring AKS data reference for detailed information on the metrics and logs created by AKS. Other Azure services and features will collect additional data and enable other analysis options as shown in the following diagram and table.

Diagram of collection of monitoring data from AKS.

Source Description
Platform metrics Platform metrics are automatically collected for AKS clusters at no cost. You can analyze these metrics with metrics explorer or use them for metric alerts.
Prometheus metrics When you enable metric scraping for your cluster, Prometheus metrics are collected by Azure Monitor managed service for Prometheus and stored in an Azure Monitor workspace. Analyze them with prebuilt dashboards in Azure Managed Grafana and with Prometheus alerts.
Activity logs Activity log is collected automatically for AKS clusters at no cost. These logs track information such as when a cluster is created or has a configuration change. Send the Activity log to a Log Analytics workspace to analyze it with your other log data.
Resource logs Control plane logs for AKS are implemented as resource logs. Create a diagnostic setting to send them to Log Analytics workspace where you can analyze and alert on them with log queries in Log Analytics.
Container insights Container insights collects various logs and performance data from a cluster including stdout/stderr streams and stores them in a Log Analytics workspace and Azure Monitor Metrics. Analyze this data with views and workbooks included with Container insights or with Log Analytics and metrics explorer.

Monitoring overview page in Azure portal

The Monitoring tab on the Overview page offers a quick way to get started viewing monitoring data in the Azure portal for each AKS cluster. This includes graphs with common metrics for the cluster separated by node pool. Click on any of these graphs to further analyze the data in metrics explorer.

The Overview page also includes links to Managed Prometheus and Container insights for the current cluster. If you haven't already enabled these tools, you'll be prompted to do so. You may also see a banner at the top of the screen recommending that you enable additional features to improve monitoring of your cluster.

Screenshot of AKS overview page.

Tip

Access monitoring features for all AKS clusters in your subscription from the Monitoring menu in the Azure portal, or for a single AKS cluster from the Monitor section of the Kubernetes services menu.

Resource logs

Control plane logs for AKS clusters are implemented as resource logs in Azure Monitor. Resource logs are not collected and stored until you create a diagnostic setting to route them to one or more locations. You'll typically send them to a Log Analytics workspace, which is where most of the data for Container insights is stored.

See Create diagnostic settings for the detailed process for creating a diagnostic setting using the Azure portal, CLI, or PowerShell. When you create a diagnostic setting, you specify which categories of logs to collect. The categories for AKS are listed in AKS monitoring data reference.

Important

There can be substantial cost when collecting resource logs for AKS, particularly for kube-audit logs. Consider the following recommendations to reduce the amount of data collected:

  • Disable kube-audit logging when not required.
  • Enable collection from kube-audit-admin, which excludes the get and list audit events.
  • Enable resource-specific logs as described below and configure AKSAudit table as basic logs.

See Monitor Kubernetes clusters using Azure services and cloud native tools for further recommendations and Cost optimization and Azure Monitor for further strategies to reduce your monitoring costs.

Screenshot of AKS diagnostic setting dialog box.

AKS supports either Azure diagnostics mode or resource-specific mode for resource logs. This specifies the tables in the Log Analytics workspace where the data is sent. Azure diagnostics mode sends all data to the AzureDiagnostics table, while resource-specific mode sends data to AKS Audit, AKS Audit Admin, and AKS Control Plane as shown in the table at Resource logs.

Resource-specific mode is recommended for AKS for the following reasons:

  • Data is easier to query because it's in individual tables dedicated to AKS.
  • Supports configuration as basic logs for significant cost savings.

For more details on the difference between collection modes including how to change an existing setting, see Select the collection mode.

Note

The ability to select the collection mode isn't available in the Azure portal in all regions yet. For those regions where it's not yet available, use CLI to create the diagnostic setting with a command such as the following:

az monitor diagnostic-settings create --name AKS-Diagnostics --resource /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/myresourcegroup/providers/Microsoft.ContainerService/managedClusters/my-cluster --logs '[{""category"": ""kube-audit"",""enabled"": true}, {""category"": ""kube-audit-admin"", ""enabled"": true}, {""category"": ""kube-apiserver"", ""enabled"": true}, {""category"": ""kube-controller-manager"", ""enabled"": true}, {""category"": ""kube-scheduler"", ""enabled"": true}, {""category"": ""cluster-autoscaler"", ""enabled"": true}, {""category"": ""cloud-controller-manager"", ""enabled"": true}, {""category"": ""guard"", ""enabled"": true}, {""category"": ""csi-azuredisk-controller"", ""enabled"": true}, {""category"": ""csi-azurefile-controller"", ""enabled"": true}, {""category"": ""csi-snapshot-controller"", ""enabled"": true}]'  --workspace /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/myresourcegroup/providers/microsoft.operationalinsights/workspaces/myworkspace --export-to-resource-specific true

Sample log queries

Important

When you select Logs from the menu for an AKS cluster, Log Analytics is opened with the query scope set to the current cluster. This means that log queries will only include data from that resource. If you want to run a query that includes data from other clusters or data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.

If the diagnostic setting for your cluster uses Azure diagnostics mode, the resource logs for AKS are stored in the AzureDiagnostics table. You can distinguish different logs with the Category column. For a description of each category, see AKS reference resource logs.

Description Log query
Count logs for each category
(Azure diagnostics mode)
AzureDiagnostics
| where ResourceType == "MANAGEDCLUSTERS"
| summarize count() by Category
All API server logs
(Azure diagnostics mode)
AzureDiagnostics
| where Category == "kube-apiserver"
All kube-audit logs in a time range
(Azure diagnostics mode)
let starttime = datetime("2023-02-23");
let endtime = datetime("2023-02-24");
AzureDiagnostics
| where TimeGenerated between(starttime..endtime)
| where Category == "kube-audit"
| extend event = parse_json(log_s)
| extend HttpMethod = tostring(event.verb)
| extend User = tostring(event.user.username)
| extend Apiserver = pod_s
| extend SourceIP = tostring(event.sourceIPs[0])
| project TimeGenerated, Category, HttpMethod, User, Apiserver, SourceIP, OperationName, event
All audit logs
(resource-specific mode)
AKSAudit
All audit logs excluding the get and list audit events
(resource-specific mode)
AKSAuditAdmin
All API server logs
(resource-specific mode)
AKSControlPlane
| where Category == "kube-apiserver"

To access a set of prebuilt queries in the Log Analytics workspace, see the Log Analytics queries interface and select resource type Kubernetes Services. For a list of common queries for Container insights, see Container insights queries.

Integrations

The following Azure services and features of Azure Monitor can be used for additional monitoring of your Kubernetes clusters. You can enable these features when you create your AKS cluster (on the Integrations tab when creating the cluster in the Azure portal), or onboard your cluster to them later. Each of these features may include additional cost, so refer to the pricing information for each before you enabled them.

Service / Feature Description
Container insights Uses a containerized version of the Azure Monitor agent to collect stdout/stderr logs, performance metrics, and Kubernetes events from each node in your cluster, supporting a variety of monitoring scenarios for AKS clusters. If you don't enable Container insights when you create your cluster, see Enable Container insights for Azure Kubernetes Service (AKS) cluster for other options to enable it.

Container insights stores most of its data in a Log Analytics workspace, and you'll typically use the same one as the resource logs for your cluster. See Design a Log Analytics workspace architecture for guidance on how many workspaces you should use and where to locate them.
Azure Monitor managed service for Prometheus Prometheus is a cloud-native metrics solution from the Cloud Native Compute Foundation and the most common tool used for collecting and analyzing metric data from Kubernetes clusters. Azure Monitor managed service for Prometheus is a fully managed Prometheus-compatible monitoring solution in Azure. If you don't enable managed Prometheus when you create your cluster, see Collect Prometheus metrics from an AKS cluster for other options to enable it.

Azure Monitor managed service for Prometheus stores its data in an Azure Monitor workspace, which is linked to a Grafana workspace so that you can analyze the data with Azure Managed Grafana.
Azure Managed Grafana Fully managed implementation of Grafana, which is an open-source data visualization platform commonly used to present Prometheus data. Multiple predefined Grafana dashboards are available for monitoring Kubernetes and full-stack troubleshooting. If you don't enable managed Grafana when you create your cluster, see Link a Grafana workspace details on linking it to your Azure Monitor workspace so it can access Prometheus metrics for your cluster.

Alerts

Azure Monitor alerts proactively notify you when important conditions are found in your monitoring data. They allow you to identify and address issues in your system before your customers notice them. You can set alerts on metrics, logs, and the activity log. Different types of alerts have benefits and drawbacks.

Metric alerts

The following table lists the recommended metric alert rules for AKS clusters. You can choose to automatically enable these alert rules when the cluster is created. These alerts are based on platform metrics for the cluster.

Condition Description
CPU Usage Percentage > 95 Fires when the average CPU usage across all nodes exceeds the threshold.
Memory Working Set Percentage > 100 Fires when the average working set across all nodes exceeds the threshold.

Prometheus alerts

When you enable collection of Prometheus metrics for your cluster, then you can download a collection of recommended Prometheus alert rules. This includes the following rules:

  • Average CPU usage per container is greater than 95%
  • Average Memory usage per container is greater than 95%
  • Number of OOM killed containers is greater than 0
  • Average PV usage is greater than 80%
  • Pod container restarted in last 1 hour
  • Node is not ready
  • Ready state of pods is less than 80%
  • Job did not complete in time
  • Average node CPU utilization is greater than 80%
  • Working set memory for a node is greater than 80%
  • Disk space usage for a node is greater than 85%
  • Number of pods in failed state are greater than 0

Next steps