Is receiving prometheus KubeAggregatedAPIErrors alerts a sign of an unhealthy AKS cluster?

AlexandraGroschner-3808 40 Reputation points
2025-06-11T15:04:17.38+00:00

Following setup:

Since around 2 weeks I frequently get alerts of type "KubeAggregatedAPIErrors" with a description that goes like "Kubernetes aggregated API v1beta1.metrics.k8s.io/default has reported errors. It has appeared unavailable 125.3 times averaged over the past 10m."

The error then resolves itself but triggers again.

I found https://github.com/prometheus-community/helm-charts/issues/3539 which among other things suggests restarts of the metrics server (due to a lack of resources) could be the problem, but this is not the case.

Some comments in the issue also mention people contacted the Azure support and it's due to their "normal" control plane activities.

But this happens now to 2 of my 4 clusters and I wanted to check if this indicates a problem or not.

Thanks in advance for any tip!

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,461 questions
{count} votes

Accepted answer
  1. Siva Pavuluri 570 Reputation points Microsoft External Staff Moderator
    2025-06-12T04:08:17.26+00:00

    Hi Alexandra Groschner,

    The alert is triggered when Prometheus detects errors or unavailability in aggregated API endpoints, averaged over a specific period (e.g., 10 minutes).

    In your case, the API v1beta1.metrics.k8s.io is provided by the Kubernetes metrics-server, which is crucial for collecting resource metrics (CPU/memory) across the cluster. If the metrics-server pod is restarting or under-resourced, it can temporarily become unavailable, triggering this alert. However, you've noted that this is not happening in your clusters.

    If the alert resolves on its own and there are no visible impacts on workload performance or metrics collection, it is generally considered a transient or benign issue. This is especially true in managed environments like AKS, where the control plane components are not under direct user control apiserver-aggregation

    If you found information is helpful, please click "Upvote" on the post to let us know.

    If you have any further queries feel free to ask us we are happy to assist you.

    Thank You.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.