Best practices for monitoring Kubernetes with Azure Monitor

This article provides best practices for monitoring the health and performance of your Azure Kubernetes Service (AKS) and Azure Arc-enabled Kubernetes clusters. The guidance is based on the five pillars of architecture excellence described in Azure Well-Architected Framework.

Reliability

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Use the following information to best leverage Azure Monitor to ensure the reliability of your Kubernetes clusters and monitoring environment.

Design checklist

  • Enable scraping of Prometheus metrics for your cluster.
  • Enable Container insights for collection of logs and performance data from your cluster.
  • Create diagnostic settings to collect control plane logs for AKS clusters.
  • Enable recommended Prometheus alerts.
  • Ensure the availability of the Log Analytics workspace supporting Container insights.

Configuration recommendations

Recommendation Benefit
Enable scraping of Prometheus metrics for your cluster. Enable Prometheus on your cluster with Azure Monitor managed service for Prometheus if you don't already have a Prometheus environment. Use Azure Managed Grafana to analyze the Prometheus data collected. See Customize scraping of Prometheus metrics in Azure Monitor managed service for Prometheus to collect additional metrics beyond the default configuration.
Enable Container insights for collection of logs and performance data from your cluster. Container insights collects stdout/stderr logs, performance metrics, and Kubernetes events from each node in your cluster. It provides dashboards and reports for analyzing this data, including the availability of your nodes and other components. Use Log Analytics to identify any availability errors in your collected logs.
Create diagnostic settings to collect control plane logs for AKS clusters. AKS implements control planes logs as resource logs in Azure Monitor. Create a diagnostic setting to send these logs to your Log Analytics workspace so you can use log queries to identify errors and issues affecting availability.
Enable recommended Prometheus alerts. Alerts in Azure Monitor proactively notify you when issues are detected. Start with a set of recommended Prometheus alert rules that detect the most common availability and performance issues with your cluster. Potentially add log search alerts using data collected by Container insights.
Ensure the availability of the Log Analytics workspace supporting Container insights. Container insights relies on a Log Analytics workspace. See Best practices for Azure Monitor Logs for recommendations to ensure the reliability of the workspace.

Security

Security is one of the most important aspects of any architecture. Azure Monitor provides features to employ both the principle of least privilege and defense-in-depth. Use the following information to monitor your Kubernetes clusters and ensure that only authorized users access collected data.

Design checklist

  • Use managed identity authentication for your cluster to connect to Container insights.
  • Consider using Azure private link for your cluster to connect to your Azure Monitor workspace using a private endpoint.
  • Use traffic analytics to monitor network traffic to and from your cluster.
  • Enable network observability.
  • Ensure the security of the Log Analytics workspace supporting Container insights.

Configuration recommendations

Recommendation Benefit
Use managed identity authentication for your cluster to connect to Container insights. Managed identity authentication is the default for new clusters. If you're using legacy authentication, you should migrate to managed identity to remove the certificate-based local authentication.
Consider using Azure private link for your cluster to connect to your Azure Monitor workspace using a private endpoint. Azure managed service for Prometheus stores its data in an Azure Monitor workspace which uses a public endpoint by default. Connections to public endpoints are secured with end-to-end encryption. If you require a private endpoint, you can use Azure private link to allow your cluster to connect to the workspace through authorized private networks. Private link can also be used to force workspace data ingestion through ExpressRoute or a VPN.

See Private Link for data ingestion for Managed Prometheus and Azure Monitor workspace for details on configuring your cluster for private link. See Use private endpoints for Managed Prometheus and Azure Monitor workspace for details on querying your data using private link.
Use traffic analytics to monitor network traffic to and from your cluster. Traffic analytics analyzes Azure Network Watcher NSG flow logs to provide insights into traffic flow in your Azure cloud. Use this tool to ensure there's no data exfiltration for your cluster and to detect if any unnecessary public IPs are exposed.
Enable network observability. Network observability add-on for AKS provides observability across the multiple layers in the Kubernetes networking stack. monitor and observe access between services in the cluster (east-west traffic).
Ensure the security of the Log Analytics workspace supporting Container insights. Container insights relies on a Log Analytics workspace. See Best practices for Azure Monitor Logs for recommendations to ensure the security of the workspace.

Cost optimization

Cost optimization refers to ways to reduce unnecessary expenses and improve operational efficiencies. You can significantly reduce your cost for Azure Monitor by understanding your different configuration options and opportunities to reduce the amount of data that it collects. See Azure Monitor cost and usage to understand the different ways that Azure Monitor charges and how to view your monthly bill.

Note

See Optimize costs in Azure Monitor for cost optimization recommendations across all features of Azure Monitor.

Design checklist

  • Don't enable Container insights collection of Prometheus metrics.
  • Configure agent collection to modify data collection in Container insights.
  • Modify settings for collection of metric data by Container insights.
  • Disable Container insights collection of metric data if you don't use the Container insights experience in the Azure portal.
  • If you don't query the container logs table regularly or use it for alerts, configure it as basic logs.
  • Limit collection of resource logs you don't need.
  • Use resource-specific logging for AKS resource logs and configure tables as basic logs.
  • Use OpenCost to collect details about your Kubernetes costs.

Configuration recommendations

Recommendation Benefit
Don't enable Container insights collection of Prometheus metrics in Log Analytics workspace if you've enabled scraping of metrics with Prometheus. In addition to scraping Prometheus metrics from your cluster using Azure Monitor managed service for Prometheus, you can configure Container insights to collect Prometheus metrics in your Log Analytics workspace. This is redundant with the data in Managed Prometheus and will result in additional cost.
Configure agent to modify data collection in Container insights. Analyze the data collected by Container insights as described in Controlling ingestion to reduce cost and adjust your configuration to stop collection of data you don't need.
Modify settings for collection of metric data by Container insights. See Enable cost optimization settings for details on modifying both the frequency that metric data is collected and the namespaces that are collected by Container insights.
Disable Container insights collection of metric data if you don't use the Container insights experience in the Azure portal. Container insights collects many of the same metric values as Managed Prometheus. You can disable collection of these metrics by configuring Container insights to only collect Logs and events as described in Enable cost optimization settings in Container insights. This configuration disables the Container insights experience in the Azure portal, but you can use Grafana to visualize Prometheus metrics and Log Analytics to analyze log data collected by Container insights.
If you don't query the container logs table regularly or use it for alerts, configure it as basic logs. Convert your Container insights schema to ContainerLogV2 which is compatible with Basic logs and can provide significant cost savings as described in Controlling ingestion to reduce cost.
Limit collection of resource logs you don't need. Control plane logs for AKS clusters are implemented as resource logs in Azure Monitor. Create a diagnostic setting to send this data to a Log Analytics workspace. See Collect control plane logs for AKS clusters for recommendations on which categories you should collect.
Use resource-specific logging for AKS resource logs and configure tables as basic logs. AKS supports either Azure diagnostics mode or resource-specific mode for resource logs. Specify resource logs to enable the option to configure the tables for basic logs, which provide a reduced ingestion charge for logs that you only occasionally query and don't use for alerting.
Use OpenCost to collect details about your Kubernetes costs. OpenCost is an open-source, vendor-neutral CNCF sandbox project for understanding your Kubernetes costs and supporting your ability to for AKS cost visibility. It exports detailed costing data in addition to customer-specific Azure pricing to Azure storage to assist the cluster administrator in analyzing and categorizing costs.

Operational excellence

Operational excellence refers to operations processes required keep a service running reliably in production. Use the following information to minimize the operational requirements for monitoring your Kubernetes clusters.

Design checklist

  • Review guidance for monitoring all layers of your Kubernetes environment.
  • Use Azure Arc-enabled Kubernetes to monitor your clusters outside of Azure.
  • Use Azure managed services for cloud native tools.
  • Integrate AKS clusters into your existing monitoring tools.
  • Use Azure policy to enable data collection from your Kubernetes cluster.

Configuration recommendations

Recommendation Benefit
Review guidance for monitoring all layers of your Kubernetes environment. Monitor your Kubernetes cluster performance with Container insights includes guidance and best practices for monitoring your entire Kubernetes environment from the network, cluster, and application layers.
Use Azure Arc-enabled Kubernetes to monitor your clusters outside of Azure. Azure Arc-enabled Kubernetes allows your Kubernetes clusters running in other clouds to be monitored using the same tools as your AKS clusters, including Container insights and Azure Monitor managed service for Prometheus.
Use Azure managed services for cloud native tools. Azure Monitor managed service for Prometheus and Azure managed Grafana support all the features of the cloud native tools Prometheus and Grafana without having to operate their underlying infrastructure. You can quickly provision these tools and onboard your Kubernetes clusters with minimal overhead. These services allow you to access an extensive library of community rules and dashboards to monitor your Kubernetes environment.
Integrate AKS clusters into your existing monitoring tools. If you have an existing investment in Prometheus and Grafana, integrate your AKS clusters and Azure managed services into your existing environment using the guidance in Monitor Kubernetes clusters using Azure services and cloud native tools.
Use Azure policy to enable data collection from your Kubernetes cluster. Use Azure Policy to enable data collection for enabling Prometheus metrics, Container insights, and diagnostic settings. This ensures that any new clusters are automatically monitored and enforces their monitoring configuration.

Performance efficiency

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. Use the following information to monitor the performance of your Kubernetes clusters and ensure they're configured for maximum performance.

Design checklist

  • Enable collection of Prometheus metrics for your cluster.
  • Enable Container insights to track performance of your cluster.
  • Enable recommended Prometheus alerts.

Configuration recommendations

Recommendation Benefit
Enable collection of Prometheus metrics for your cluster. Prometheus is a cloud-native metrics solution from the Cloud Native Compute Foundation and the most common tool used for collecting and analyzing metric data from Kubernetes clusters. Enable Prometheus on your cluster with Azure Monitor managed service for Prometheus if you don't already have a Prometheus environment. Use Azure Managed Grafana to analyze the Prometheus data collected.

See Customize scraping of Prometheus metrics in Azure Monitor managed service for Prometheus to collect additional metrics beyond the default configuration.
Enable Container insights to track performance of your cluster. When you enable Container insights for your Kubernetes cluster, you can use views and workbooks to track the performance of the components of your cluster. This data may overlap with data collected by Prometheus. See Cost optimization for recommendations regarding cost.
Enable recommended Prometheus alerts. Alerts in Azure Monitor proactively notify you when issues are detected. Start with a set of recommended Prometheus alert rules that detect the most common availability and performance issues with your cluster. Potentially add log search alerts using data collected by Container insights.

Next step