Scrape Prometheus metrics at scale in Azure Monitor

This article provides guidance on performance that can be expected when collection metrics at high scale for Azure Monitor managed service for Prometheus.

CPU and memory

The CPU and memory usage is correlated with the number of bytes of each sample and the number of samples scraped. These benchmarks are based on the default targets scraped, volume of custom metrics scraped, and number of nodes, pods, and containers. These numbers are meant as a reference since usage can still vary significantly depending on the number of time series and bytes per metric.

The upper volume limit per pod is currently about 3-3.5 million samples per minute, depending on the number of bytes per sample. This limitation is addressed when sharding is added in future.

The agent consists of a deployment with one replica and DaemonSet for scraping metrics. The DaemonSet scrapes any node-level targets such as cAdvisor, kubelet, and node exporter. You can also configure it to scrape any custom targets at the node level with static configs. The replica set scrapes everything else such as kube-state-metrics or custom scrape jobs that utilize service discovery.

Comparison between small and large cluster for replica

Scrape Targets Samples Sent / Minute Node Count Pod Count Prometheus-Collector CPU Usage (cores) Prometheus-Collector Memory Usage (bytes)
default targets 11,344 3 40 12.9 mc 148 Mi
default targets 260,000 340 13000 1.10 c 1.70 GB
default targets
+ custom targets
3.56 million 340 13000 5.13 c 9.52 GB

Comparison between small and large cluster for DaemonSets

Scrape Targets Samples Sent / Minute Total Samples Sent / Minute / Pod Node Count Pod Count Prometheus-Collector CPU Usage Total (cores) Prometheus-Collector Memory Usage Total (bytes) Prometheus-Collector CPU Usage / Pod (cores) Prometheus-Collector Memory Usage / Pod (bytes)
default targets 9,858 3,327 3 40 41.9 mc 581 Mi 14.7 mc 189 Mi
default targets 2.3 million 14,400 340 13000 805 mc 305.34 GB 2.36 mc 898 Mi

For more custom metrics, the single pod behaves the same as the replica pod depending on the volume of custom metrics.

Schedule ama-metrics replica pod on a node pool with more resources

A large volume of metrics per pod needs a node with enough CPU and memory. If the ama-metrics replica pod isn't scheduled on a node or node pool with enough resources, it might get OOMKilled and go into CrashLoopBackoff. To fix this, you can add the label azuremonitor/metrics.replica.preferred=true to a node or node pool on your cluster with higher resources (in system node pool). This ensures the replica pod gets scheduled on that node. You can also create extra system pools with larger nodes and add the same label. It's better to label node pools rather than individual nodes so new nodes in the pool can also be used for scheduling.

kubectl label nodes <node-name> azuremonitor/metrics.replica.preferred="true"

Next steps