AKS CoreDNS Override Reverting Automatically — DNS Resolution Failure in Production Cluster

Praveen Karape 0 Reputation points
2025-07-03T04:39:59.4266667+00:00

Hello Team,

We are encountering a critical DNS resolution issue in our production Azure Kubernetes Service (AKS) cluster and require assistance to understand the root cause and implement a stable solution.


Issue Summary:

  • Cluster Name: abc-AKS-Production

Namespace: kube-system

Deployment: coredns

Problem: Pods in the cluster are unable to resolve external domains (e.g., www.abc.com, api.abc.com), causing widespread service disruptions.


Error Logs:

CoreDNS pods repeatedly show the following timeout error:

bashCopyEdit[ERROR] plugin/errors: www.abc.com. A: read udp ...->168.63.129.16:53: i/o timeout

Attempts & Observations:

Tried CoreDNS override using coredns-custom ConfigMap with correct volume mount, following

Added entries like:

bashCopyEditforward . 168.63.129.16 8.8.8.8

Restarted CoreDNS pods — DNS resolution worked temporarily but reverted within minutes, and failures resumed.

Attempted using Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS as fallback.

Verified that no manual or automated process from our side is overwriting the ConfigMap.


Additional Behavior Observed:

DNS config changes revert automatically, likely due to AKS internal reconciliation process.

Connection to the cluster via Lens intermittently fails with some APIs not responding.

After DNS auto-reverted to default (*.aksdns.net), resolution began working again without manual intervention.

However, this behavior is inconsistent and unpredictable.


Request for Clarification & Help:

Why is the supported CoreDNS override not being honored consistently?

Is this automatic rollback behavior a known AKS issue or expected due to AKS CoreDNS reconciliation logic?

Is there a stable, long-term supported method to override upstream DNS settings in AKS clusters, especially when Azure DNS becomes unreliable?

What caused the DNS failures and Lens connection issues, and what preventive measures can be taken to avoid this in the future?


We are currently using the default AKS DNS configuration and have not made persistent changes previously. However, this issue has raised significant concerns regarding stability and resilience.

We would appreciate any guidance, Microsoft documentation, or best practices on ensuring stable DNS resolution in AKS clusters, especially under intermittent Azure DNS outages.

Thank you,

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,463 questions
{count} votes

1 answer

Sort by: Most helpful
  1. ArkoSen-2904 10 Reputation points
    2025-07-04T04:39:27.6966667+00:00

    Hello Praveen Karape,
    When using the default AKS DNS add-on, AKS regularly reconciles and enforces the original configuration. This means any direct modifications to the CoreDNS config will be overwritten unless you disable the managed DNS add-on. To apply a persistent override, you have two options-

    1. Disable AKS-managed DNS to prevent it from reverting custom CoreDNS changes. This may require checking if your cluster version supports this feature via the az aks update command.
    2. Alternatively you can deploy your own self-managed CoreDNS instance within the cluster and configure your workloads to use it explicitly by setting dnsPolicy: None and defining dnsConfig to route queries through your custom resolver.

    The intermittent DNS failures you're seeing could be due to transient Azure DNS availability issues, which are rare but can impact critical workloads if there's no fallback.

    For more details please check this MS document- https://learn.microsoft.com/en-us/azure/aks/coredns-custom


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.