problem with azure core DNS (dns resolve is not working)

Shreyas Arani 266 Reputation points
2021-11-11T11:19:46.497+00:00

Hi I guess I am having some problem with azure core DNS. I am following this link dns-debugging-resolution

According this doc I have deployed sample pod to test. kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
And then executed this command kubectl exec -i -t dnsutils -- nslookup kubernetes.default but I am getting the following error

kubectl exec -i -t dnsutils -- nslookup kubernetes.default  
;; connection timed out; no servers could be reached  
  
command terminated with exit code 1  

so I guess there is some problem with core DNS. I checked whether coreDNS pod and svc is running or not , from below command it shows it is running fine

kubectl get pods --namespace=kube-system -l k8s-app=kube-dns  
NAME                       READY   STATUS    RESTARTS   AGE  
coredns-84d976c568-jhbvw   1/1     Running   0          47h  
coredns-84d976c568-wdkgg   1/1     Running   0          47h  
  
 kubectl get svc --namespace=kube-system | grep dns  
kube-dns         ClusterIP   10.0.0.10      <none>        53/UDP,53/TCP                  135d  

Further checking the logs of core dns pod I am getting the below warnings which is quite suspicious and unexpected in the logs. Hence which makes more evident that core dns has some issue.

kubectl logs --namespace=kube-system -l k8s-app=kube-dns  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  

Due to this DNS issue our istio pods are facing problem in DNS resolution and getting timeout error.

2021-11-11T06:02:17.996723Z     error   citadelclient   Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996737Z     error   cache   resource:default request:b7411448-02b9-48bb-aab8-a36966e829fb CSR retrial timed out: rpc error: code = Unavailable desc= connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996751Z     error   cache   resource:default failed to generate secret for proxy: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996759Z     error   sds     resource:default Close connection. Failed to get secret for proxy "sidecar~10.X.X.X~sleep-78c656c8ff-bhx5k.foo~foo.svc.cluster.local" from secret cache: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996842Z     info    sds     resource:default connection is terminated: rpc error: code = Canceled desc = context canceled  

We have another AKS cluster and followed the same steps as above and we got expected output when we executed the below command

kubectl exec -i -t dnsutils -- nslookup kubernetes.default  
Server:         10.0.0.10  
Address:        10.0.0.10#53  
  
Name:   kubernetes.default.svc.cluster.local  
Address: 10.0.0.1  

Can anyone please help me to resolve this issue?

Thanks in advance

@SRIJIT-BOSE-MSFT can you please help me to resolve this issue?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,127 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amjad Nagori 286 Reputation points
    2022-01-14T10:01:27.08+00:00

    I got the same issue and was able to resolved it, not sure if you have the same problem or different but it will be worth checking.
    In my scenario DNS was working fine, then I increased the number of Nodes in AKS Pool and after that it stopped working so I tried to find root cause and found -
    Whenever we are increasing the number of Nodes, AKS creating a new Route into the Route Table for the new node but in my org we had policy to restrict any changes in any Route Table due to which the route was not added into Route Table and I starts getting this issue.

    ![165020-image.png][2]

    I had only 1 route (except Default) which was created initially when we exempt the policy, but now as the policy is active again then AKS is not able to create further route due to that and we starts getting this issue.
    Technically every node of AKS should have separate Route with next hop IP address of the same. [2]: /api/attachments/165020-image.png?platform=QnA

    1 person found this answer helpful.
    0 comments No comments

  2. Denis Shendrik 0 Reputation points
    2023-12-19T20:34:41.1+00:00

    I am posting a similar problem and its solution, just in case it proves helpful to someone.

    After manual cert rotation (az aks rotate-certs) of a cluster the pods could not connect to a Redis server in Azure. After following the https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/troubleshoot-dns-failure-from-pod-but-not-from-worker-node docs and the commands suggested in this thread it became clear that the problem is DNS related as nslookup using 10.0.0.10#53 was timing out and only 1 out of 3 coredns pods was able to resolve "google.com", the rest 2 pods were unreachable.

    After stopping and starting the k8s cluster the issue was resolved.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.