AKS with Applicaton Gateway ingress controller (AGIC). After upgrade AKS to 1.24.6 all probs are unhealthy inside AG.

Question

After upgrade AKS with Application Gateway ingress controller (AGIC) to 1.24.6 all probs are unhealthy and throw timeout issues.

Time taken by the backend to respond to application gateway's health probe is more than the time-out threshold in the probe configuration. Either increase the time-out threshold in the probe configuration or resolve the backend issues. Note: for default probe, the http timeout is 30s.

Pods / services do not have any issues inside cluster, logs of ingress controller pod do not show any issues also. Listeners, backend settings, rules and health probs are created successfully.

Downgrade to 1.22.15 solved issue.

Is there any specific with AKS 1.24 and AGIC ingress controller or it is a bug ?

Answer

Nothing of this is applicable to our case. We upgraded QA environment probs stopped working (just timeouts no any traces), to replicate it we upgraded Development and it was the same issue. QA was downgraded to 1.22.15 and Dev still on 1.24.6. Without any changes after 2 days Dev (1.24.6) recovered and started to work too, I don't have any explanation for that will monitor further.

Answer

We are facing exactly same issue with same version. Downgrade is not supported, Can you please confirm if there is an issue with this particular version?

Answer

We found this while facing the same issue.

What we have is an Application Gateway with a single backend IP address.

So, the listener in the AppGW have just the IP address of the k8s service managing the LB IP of the nginx ingress controller svc.

The nginx ingress controller have more than one pod so the single IP is not a single point of failure.

What it is strange to me is that I have tested from one k8s pod

kubectl -n nginx exec deploy/nginx-ingress-ingress-nginx-controller -it -- /bin/bashbash-5.1$ curl -v http://172.21.1.253/healthz

Trying 172.21.1.253:80...

Connected to 172.21.1.253 (172.21.1.253) port 80 (#0)

GET /healthz HTTP/1.1

Host: 172.21.1.253

User-Agent: curl/7.79.1

Accept: /

Mark bundle as not supporting multiuse

< HTTP/1.1 200 OK

< Date: Fri, 13 Jan 2023 08:34:06 GMT

< Content-Type: text/html

< Content-Length: 0

< Connection: keep-alive

<

Connection #0 to host 172.21.1.253 left intact

This seems the curl is going outside AKS to the nginx SLB and back to aks pod.

But investigating we found the SLB probes were changed from TCP to HTTP.

We reverted manually this change and everything started to work for few minutes and suddenly something changed again to HTTP.

Who change this must be an internal Az service account; therefore must be something in the service.

For the nginx ingress-controller we use an internal-SLB declare in the service with just only this annotation

service.beta.kubernetes.io/azure-load-balancer-internal: 'true'

adding just this other one

service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: healthz

resolve our issue.

Because nginx respond HTTP 200 to /healthz but HTTP 404 to / which is the default path is you do not set with another annotation

Somehow, and I don't know exactly the reason the curl command to the SLB worked but between Azure Services (AppGW health probes and SLB health probes) they understand that SLB was unhealthy because the change from TCP to HTTP.

You can find more information here (the second link has the changes in probes from k8s 1.23 to 1.24):

https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations

https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe

Now we have our AKS working using k8s 1.25.2

I hope you find this interesting

Regards

Share via

AKS with Applicaton Gateway ingress controller (AGIC). After upgrade AKS to 1.24.6 all probs are unhealthy inside AG.

3 answers

Your answer