problem with azure core DNS (dns resolve is not working)

Question

problem with azure core DNS (dns resolve is not working)

Shreyas Arani 271

Hi I guess I am having some problem with azure core DNS. I am following this link dns-debugging-resolution

According this doc I have deployed sample pod to test. kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
And then executed this command kubectl exec -i -t dnsutils -- nslookup kubernetes.default but I am getting the following error

kubectl exec -i -t dnsutils -- nslookup kubernetes.default  
;; connection timed out; no servers could be reached  
  
command terminated with exit code 1

so I guess there is some problem with core DNS. I checked whether coreDNS pod and svc is running or not , from below command it shows it is running fine

kubectl get pods --namespace=kube-system -l k8s-app=kube-dns  
NAME                       READY   STATUS    RESTARTS   AGE  
coredns-84d976c568-jhbvw   1/1     Running   0          47h  
coredns-84d976c568-wdkgg   1/1     Running   0          47h  
  
 kubectl get svc --namespace=kube-system | grep dns  
kube-dns         ClusterIP   10.0.0.10      <none>        53/UDP,53/TCP                  135d

Further checking the logs of core dns pod I am getting the below warnings which is quite suspicious and unexpected in the logs. Hence which makes more evident that core dns has some issue.

kubectl logs --namespace=kube-system -l k8s-app=kube-dns  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server  
[WARNING] No files matching import glob pattern: custom/*.override  
[WARNING] No files matching import glob pattern: custom/*.server

Due to this DNS issue our istio pods are facing problem in DNS resolution and getting timeout error.

2021-11-11T06:02:17.996723Z     error   citadelclient   Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996737Z     error   cache   resource:default request:b7411448-02b9-48bb-aab8-a36966e829fb CSR retrial timed out: rpc error: code = Unavailable desc= connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996751Z     error   cache   resource:default failed to generate secret for proxy: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996759Z     error   sds     resource:default Close connection. Failed to get secret for proxy "sidecar~10.X.X.X~sleep-78c656c8ff-bhx5k.foo~foo.svc.cluster.local" from secret cache: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.X.X.X:53: read udp 10.X.X.X:46002->10.X.X.X:53: i/o timeout"  
2021-11-11T06:02:17.996842Z     info    sds     resource:default connection is terminated: rpc error: code = Canceled desc = context canceled

We have another AKS cluster and followed the same steps as above and we got expected output when we executed the below command

kubectl exec -i -t dnsutils -- nslookup kubernetes.default  
Server:         10.0.0.10  
Address:        10.0.0.10#53  
  
Name:   kubernetes.default.svc.cluster.local  
Address: 10.0.0.1

Can anyone please help me to resolve this issue?

Thanks in advance

@SRIJIT-BOSE-MSFT can you please help me to resolve this issue?

SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T04:27:44.927+00:00

@Shreyas Arani , thanks for the question. Are you using the coredns-custom Configmap in the kube-system namespace to customize CoreDNS? If so, then please check if the data is correctly set.

Quick reference: https://github.com/coredns/coredns/issues/3600
Shreyas Arani 271 Reputation points

2021-11-12T06:17:09.327+00:00

Hi @SRIJIT-BOSE-MSFT I haven't used any configmap to customize the coreDNS. Just using the default which comes with deploying aks cluster.
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T08:01:39.607+00:00

@Shreyas Arani , do you have any Network Policy configured which denies ingress to the coredns pods?

Also can you please share the output from the following commands?

kubectl logs coredns-xxxxxxxxxx-xxxxx -n kube-system for all the coredns pods seperately,
kubectl describe po -l k8s-app=kube-dns -A, and,
for i in $(k get po --namespace=kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'); do k top node $i; done

SRIJIT-BOSE-MSFT 4,346 Microsoft Employee

@Shreyas Arani , can you please also perform the following the following?

Get the Pod IPs of the coredns Pods and the ClusterIP of the kube-dns Service and check if trying to resolve a domain name fails with both or only with the Service's ClusterIP.

For example,

kubectl get svc,po -n kube-system -l=k8s-app=kube-dns -o wide  
NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE   SELECTOR  
service/kube-dns   ClusterIP   10.0.0.10    <none>        53/UDP,53/TCP   29d   k8s-app=kube-dns  
  
NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE                                NOMINATED NODE   READINESS GATES  
pod/coredns-9d6c6c99b-fhgjs   1/1     Running   5          14d   10.240.0.24   aks-agentpool52984259-vmss000000   <none>           <none>  
pod/coredns-9d6c6c99b-p9jpg   1/1     Running   5          14d   10.240.0.82   aks-agentpool-52984259-vmss000000   <none>           <none>  
  
kubectl exec -it dnsutils -- nslookup google.com 10.0.0.10  
;; connection timed out; no servers could be reached  
  
kubectl exec -it dnsutils -- nslookup google.com 10.240.0.24  
Server:         10.240.0.24  
Address:        10.240.0.24#53  
  
Non-authoritative answer:  
Name:   google.com  
Address: 142.250.183.206  
Name:   google.com  
Address: 2404:6800:4009:824::200e

Note: You can also replace google.com with kubernetes.default

Shreyas Arani 271 Reputation points

2021-11-12T08:44:09.93+00:00

@SRIJIT-BOSE-MSFT I haven't configured any network policies which blocks ingress/egress. I have two nodes in my cluster and attaching the logs of both coredns pod along with the output of describe command output.148740-dnspod1-logs.txt 148786-dnspod2-logs.txt 148825-dnsdescribepod-logs.txt

Shreyas Arani 271

@SRIJIT-BOSE-MSFT unable to perform ns lookup from both ClusterIp as well as coredns pod IP.

 kubectl get svc,po -n kube-system -l=k8s-app=kube-dns -o wide  
NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE    SELECTOR  
service/kube-dns   ClusterIP   10.0.0.10    <none>        53/UDP,53/TCP   136d   k8s-app=kube-dns  
  
NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE                                NOMINATED NODE   READINESS GATES  
pod/coredns-84d976c568-5dcw8   1/1     Running   0          18h   10.244.1.17   aks-agentpool-28249642-vmss00002j   <none>           <none>  
pod/coredns-84d976c568-zctgs   1/1     Running   0          18h   10.244.1.16   aks-agentpool-28249642-vmss00002j   <none>           <none>  
  
 kubectl exec -it dnsutils -- nslookup google.com 10.0.0.10  
;; connection timed out; no servers could be reached  
  
command terminated with exit code 1  
 kubectl exec -it dnsutils -- nslookup kubernetes.default 10.0.0.10  
;; connection timed out; no servers could be reached  
  
command terminated with exit code 1  
  kubectl exec -it dnsutils -- nslookup google.com 10.244.1.17  
;; connection timed out; no servers could be reached  
  
command terminated with exit code 1  
 kubectl exec -it dnsutils -- nslookup kubernetes.default 10.244.1.17  
;; connection timed out; no servers could be reached  
  
command terminated with exit code 1

Shreyas Arani 271

@SRIJIT-BOSE-MSFT please find the output of command

for i in $(kubectl get po --namespace=kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'); do kubectl top node $i; done  
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
aks-agentpool-28249642-vmss00002j   161m         8%     3212Mi          70%  
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
aks-agentpool-28249642-vmss00002j   161m         8%     3212Mi          70%

SRIJIT-BOSE-MSFT 4,346 Microsoft Employee

@Shreyas Arani , thank you for sharing all the logs.

What immediately jumps out at me are errors similar to this:

[ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:54531->168.63.129.16:53: i/o timeout  
[ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:39967->168.63.129.16:53: i/o timeout  
[ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:53534->168.63.129.16:53: i/o timeout  
[ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:48475->168.63.129.16:53: i/o timeout

One last question for me to be able to get to a good understanding of this situation is: Are you using the kubenet network plugin and an existing vnet/subnet with the AKS cluster? Or are you using Azure CNI network plugin with the AKS cluster?

You should be able to check the network plugin for the AKS cluster using:

az aks show -g $ResourceGroupName -n $ClusterName --query networkProfile.networkPlugin -o tsv

Shreyas Arani 271 Reputation points

2021-11-12T10:40:54.607+00:00
@SRIJIT-BOSE-MSFT we are using default kubenet network plugin and not sure about the subnet. How to check which subnet AKS cluster is using?

az aks show -g rg-infra-prototyping-1 -n test-cluster --query networkProfile.networkPlugin -o tsv kubenet
Shreyas Arani 271 Reputation points

2021-11-12T11:28:34.87+00:00

@SRIJIT-BOSE-MSFT this is the network info related to aks cluster
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T11:42:59.807+00:00
@Shreyas Arani , please find the following steps to find out the VNET/subnet being used by the AKS cluster nodes:

Go to the AKS node resource group (by default MC_<resourcegroupname>_<clustername>_<location>) on the Azure Portal.

Navigate to the node virtual machine scale set and go to the Networking menu from the left hand panel.

Click on the Virtual network/subnet link as shown below:
Shreyas Arani 271 Reputation points

2021-11-12T12:14:10.877+00:00

@SRIJIT-BOSE-MSFT following is the vnet/subnet used my aks cluster
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T13:33:13.897+00:00
@Shreyas Arani , thank you for sharing the details.

One possibility might be that subnet is not associated with the AKS route table. If that is the case, please associate it which should resolve the issue.

To do this, first browse to the virtual network being used by AKS in the portal.

From there, click Subnets from the left hand panel> Click on the name of subnet being used by AKS from the list.

You will see a section labeled Route table on the overlay that appears and below it will either list the name of the route table or "None". Click on this to select the route table to be used here if "None" is shown, then click Save on the subnet blade. You can confirm the name of the AKS route table from the node resource group.

[Reference]

If this is not the case, then we would recommend opening a Microsoft Technical Support Request to proceed with in-depth troubleshooting.
Shreyas Arani 271 Reputation points

2021-11-12T13:42:19.84+00:00

@SRIJIT-BOSE-MSFT the route table is already associated with the subnet. Following snapshot confirms it

2 answers

Your answer

SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T04:27:44.927+00:00

@Shreyas Arani , thanks for the question. Are you using the coredns-custom Configmap in the kube-system namespace to customize CoreDNS? If so, then please check if the data is correctly set.

Quick reference: https://github.com/coredns/coredns/issues/3600
Shreyas Arani 271 Reputation points

2021-11-12T06:17:09.327+00:00

Hi @SRIJIT-BOSE-MSFT I haven't used any configmap to customize the coreDNS. Just using the default which comes with deploying aks cluster.
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T08:01:39.607+00:00

@Shreyas Arani , do you have any Network Policy configured which denies ingress to the coredns pods?

Also can you please share the output from the following commands?

kubectl logs coredns-xxxxxxxxxx-xxxxx -n kube-system for all the coredns pods seperately,
kubectl describe po -l k8s-app=kube-dns -A, and,
for i in $(k get po --namespace=kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'); do k top node $i; done
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T08:15:59.397+00:00

@Shreyas Arani , can you please also perform the following the following?

Get the Pod IPs of the coredns Pods and the ClusterIP of the kube-dns Service and check if trying to resolve a domain name fails with both or only with the Service's ClusterIP.

For example,

kubectl get svc,po -n kube-system -l=k8s-app=kube-dns -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 29d k8s-app=kube-dns NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/coredns-9d6c6c99b-fhgjs 1/1 Running 5 14d 10.240.0.24 aks-agentpool52984259-vmss000000 <none> <none> pod/coredns-9d6c6c99b-p9jpg 1/1 Running 5 14d 10.240.0.82 aks-agentpool-52984259-vmss000000 <none> <none> kubectl exec -it dnsutils -- nslookup google.com 10.0.0.10 ;; connection timed out; no servers could be reached kubectl exec -it dnsutils -- nslookup google.com 10.240.0.24 Server: 10.240.0.24 Address: 10.240.0.24#53 Non-authoritative answer: Name: google.com Address: 142.250.183.206 Name: google.com Address: 2404:6800:4009:824::200e

Note: You can also replace google.com with kubernetes.default
Shreyas Arani 271 Reputation points

2021-11-12T08:44:09.93+00:00

@SRIJIT-BOSE-MSFT I haven't configured any network policies which blocks ingress/egress. I have two nodes in my cluster and attaching the logs of both coredns pod along with the output of describe command output.148740-dnspod1-logs.txt 148786-dnspod2-logs.txt 148825-dnsdescribepod-logs.txt
Shreyas Arani 271 Reputation points

2021-11-12T08:47:40.473+00:00

@SRIJIT-BOSE-MSFT unable to perform ns lookup from both ClusterIp as well as coredns pod IP.

kubectl get svc,po -n kube-system -l=k8s-app=kube-dns -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 136d k8s-app=kube-dns NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/coredns-84d976c568-5dcw8 1/1 Running 0 18h 10.244.1.17 aks-agentpool-28249642-vmss00002j <none> <none> pod/coredns-84d976c568-zctgs 1/1 Running 0 18h 10.244.1.16 aks-agentpool-28249642-vmss00002j <none> <none> kubectl exec -it dnsutils -- nslookup google.com 10.0.0.10 ;; connection timed out; no servers could be reached command terminated with exit code 1 kubectl exec -it dnsutils -- nslookup kubernetes.default 10.0.0.10 ;; connection timed out; no servers could be reached command terminated with exit code 1 kubectl exec -it dnsutils -- nslookup google.com 10.244.1.17 ;; connection timed out; no servers could be reached command terminated with exit code 1 kubectl exec -it dnsutils -- nslookup kubernetes.default 10.244.1.17 ;; connection timed out; no servers could be reached command terminated with exit code 1
Shreyas Arani 271 Reputation points

2021-11-12T09:04:38.037+00:00

@SRIJIT-BOSE-MSFT please find the output of command

for i in $(kubectl get po --namespace=kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'); do kubectl top node $i; done NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% aks-agentpool-28249642-vmss00002j 161m 8% 3212Mi 70% NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% aks-agentpool-28249642-vmss00002j 161m 8% 3212Mi 70%
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T10:33:31.18+00:00

@Shreyas Arani , thank you for sharing all the logs.

What immediately jumps out at me are errors similar to this:

[ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:54531->168.63.129.16:53: i/o timeout [ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:39967->168.63.129.16:53: i/o timeout [ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:53534->168.63.129.16:53: i/o timeout [ERROR] plugin/errors: 2 2235095750251820864.5183428657909233302. HINFO: read udp 10.244.1.17:48475->168.63.129.16:53: i/o timeout

One last question for me to be able to get to a good understanding of this situation is: Are you using the kubenet network plugin and an existing vnet/subnet with the AKS cluster? Or are you using Azure CNI network plugin with the AKS cluster?

You should be able to check the network plugin for the AKS cluster using:

az aks show -g $ResourceGroupName -n $ClusterName --query networkProfile.networkPlugin -o tsv
Shreyas Arani 271 Reputation points

2021-11-12T10:40:54.607+00:00

@SRIJIT-BOSE-MSFT we are using default kubenet network plugin and not sure about the subnet. How to check which subnet AKS cluster is using?

az aks show -g rg-infra-prototyping-1 -n test-cluster --query networkProfile.networkPlugin -o tsv kubenet
Shreyas Arani 271 Reputation points

2021-11-12T11:28:34.87+00:00

@SRIJIT-BOSE-MSFT this is the network info related to aks cluster
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T11:42:59.807+00:00

@Shreyas Arani , please find the following steps to find out the VNET/subnet being used by the AKS cluster nodes:

Go to the AKS node resource group (by default MC_<resourcegroupname>_<clustername>_<location>) on the Azure Portal.

Navigate to the node virtual machine scale set and go to the Networking menu from the left hand panel.

Click on the Virtual network/subnet link as shown below:
Shreyas Arani 271 Reputation points

2021-11-12T12:14:10.877+00:00

@SRIJIT-BOSE-MSFT following is the vnet/subnet used my aks cluster
SRIJIT-BOSE-MSFT 4,346 Reputation points Microsoft Employee

2021-11-12T13:33:13.897+00:00

@Shreyas Arani , thank you for sharing the details.

One possibility might be that subnet is not associated with the AKS route table. If that is the case, please associate it which should resolve the issue.

To do this, first browse to the virtual network being used by AKS in the portal.

From there, click Subnets from the left hand panel> Click on the name of subnet being used by AKS from the list.

You will see a section labeled Route table on the overlay that appears and below it will either list the name of the route table or "None". Click on this to select the route table to be used here if "None" is shown, then click Save on the subnet blade. You can confirm the name of the AKS route table from the node resource group.

[Reference]

If this is not the case, then we would recommend opening a Microsoft Technical Support Request to proceed with in-depth troubleshooting.
Shreyas Arani 271 Reputation points

2021-11-12T13:42:19.84+00:00

@SRIJIT-BOSE-MSFT the route table is already associated with the subnet. Following snapshot confirms it

Answer 1

I got the same issue and was able to resolved it, not sure if you have the same problem or different but it will be worth checking.
In my scenario DNS was working fine, then I increased the number of Nodes in AKS Pool and after that it stopped working so I tried to find root cause and found -
Whenever we are increasing the number of Nodes, AKS creating a new Route into the Route Table for the new node but in my org we had policy to restrict any changes in any Route Table due to which the route was not added into Route Table and I starts getting this issue.

![165020-image.png][2]

I had only 1 route (except Default) which was created initially when we exempt the policy, but now as the policy is active again then AKS is not able to create further route due to that and we starts getting this issue.
Technically every node of AKS should have separate Route with next hop IP address of the same. [2]: /api/attachments/165020-image.png?platform=QnA

Answer 2

I am posting a similar problem and its solution, just in case it proves helpful to someone.

After manual cert rotation (az aks rotate-certs) of a cluster the pods could not connect to a Redis server in Azure. After following the https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/troubleshoot-dns-failure-from-pod-but-not-from-worker-node docs and the commands suggested in this thread it became clear that the problem is DNS related as nslookup using 10.0.0.10#53 was timing out and only 1 out of 3 coredns pods was able to resolve "google.com", the rest 2 pods were unreachable.

After stopping and starting the k8s cluster the issue was resolved.

Share via

problem with azure core DNS (dns resolve is not working)

2 answers

Your answer