Hello Team,
We are encountering a critical DNS resolution issue in our production Azure Kubernetes Service (AKS) cluster and require assistance to understand the root cause and implement a stable solution.
Issue Summary:
- Cluster Name:
abc-AKS-Production
Namespace: kube-system
Deployment: coredns
Problem: Pods in the cluster are unable to resolve external domains (e.g., www.abc.com
, api.abc.com
), causing widespread service disruptions.
Error Logs:
CoreDNS pods repeatedly show the following timeout error:
bashCopyEdit[ERROR] plugin/errors: www.abc.com. A: read udp ...->168.63.129.16:53: i/o timeout
Attempts & Observations:
Tried CoreDNS override using coredns-custom
ConfigMap with correct volume mount, following
Added entries like:
bashCopyEditforward . 168.63.129.16 8.8.8.8
Restarted CoreDNS pods — DNS resolution worked temporarily but reverted within minutes, and failures resumed.
Attempted using Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS as fallback.
Verified that no manual or automated process from our side is overwriting the ConfigMap.
Additional Behavior Observed:
DNS config changes revert automatically, likely due to AKS internal reconciliation process.
Connection to the cluster via Lens intermittently fails with some APIs not responding.
After DNS auto-reverted to default (*.aksdns.net
), resolution began working again without manual intervention.
However, this behavior is inconsistent and unpredictable.
Request for Clarification & Help:
Why is the supported CoreDNS override not being honored consistently?
Is this automatic rollback behavior a known AKS issue or expected due to AKS CoreDNS reconciliation logic?
Is there a stable, long-term supported method to override upstream DNS settings in AKS clusters, especially when Azure DNS becomes unreliable?
What caused the DNS failures and Lens connection issues, and what preventive measures can be taken to avoid this in the future?
We are currently using the default AKS DNS configuration and have not made persistent changes previously. However, this issue has raised significant concerns regarding stability and resilience.
We would appreciate any guidance, Microsoft documentation, or best practices on ensuring stable DNS resolution in AKS clusters, especially under intermittent Azure DNS outages.
Thank you,