AKS with Applicaton Gateway ingress controller (AGIC). After upgrade AKS to 1.24.6 all probs are unhealthy inside AG.

Pavel 11 Reputation points
2022-10-17T05:55:13.893+00:00

After upgrade AKS with Application Gateway ingress controller (AGIC) to 1.24.6 all probs are unhealthy and throw timeout issues.

Time taken by the backend to respond to application gateway's health probe is more than the time-out threshold in the probe configuration. Either increase the time-out threshold in the probe configuration or resolve the backend issues. Note: for default probe, the http timeout is 30s.

Pods / services do not have any issues inside cluster, logs of ingress controller pod do not show any issues also. Listeners, backend settings, rules and health probs are created successfully.

Downgrade to 1.22.15 solved issue.

Is there any specific with AKS 1.24 and AGIC ingress controller or it is a bug ?

Azure Application Gateway
Azure Application Gateway
An Azure service that provides a platform-managed, scalable, and highly available application delivery controller as a service.
964 questions
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,877 questions
{count} vote

3 answers

Sort by: Most helpful
  1. Pavel 11 Reputation points
    2022-10-19T00:48:44.54+00:00

    Nothing of this is applicable to our case. We upgraded QA environment probs stopped working (just timeouts no any traces), to replicate it we upgraded Development and it was the same issue. QA was downgraded to 1.22.15 and Dev still on 1.24.6. Without any changes after 2 days Dev (1.24.6) recovered and started to work too, I don't have any explanation for that will monitor further.

    1 person found this answer helpful.

  2. Dharmaraj Upase 6 Reputation points
    2022-11-25T06:55:05.407+00:00

    We are facing exactly same issue with same version. Downgrade is not supported, Can you please confirm if there is an issue with this particular version?

    1 person found this answer helpful.
    0 comments No comments

  3. Agustín Barriada 0 Reputation points
    2023-01-13T08:46:25.1033333+00:00

    We found this while facing the same issue.

    What we have is an Application Gateway with a single backend IP address.

    So, the listener in the AppGW have just the IP address of the k8s service managing the LB IP of the nginx ingress controller svc.

    The nginx ingress controller have more than one pod so the single IP is not a single point of failure.

    What it is strange to me is that I have tested from one k8s pod

    kubectl -n nginx exec deploy/nginx-ingress-ingress-nginx-controller -it -- /bin/bashbash-5.1$ curl -v http://172.21.1.253/healthz

    Trying 172.21.1.253:80...

    Connected to 172.21.1.253 (172.21.1.253) port 80 (#0)

    GET /healthz HTTP/1.1

    Host: 172.21.1.253

    User-Agent: curl/7.79.1

    Accept: /

    Mark bundle as not supporting multiuse

    < HTTP/1.1 200 OK

    < Date: Fri, 13 Jan 2023 08:34:06 GMT

    < Content-Type: text/html

    < Content-Length: 0

    < Connection: keep-alive

    <

    Connection #0 to host 172.21.1.253 left intact

    This seems the curl is going outside AKS to the nginx SLB and back to aks pod.

    But investigating we found the SLB probes were changed from TCP to HTTP.

    We reverted manually this change and everything started to work for few minutes and suddenly something changed again to HTTP.

    Who change this must be an internal Az service account; therefore must be something in the service.

    For the nginx ingress-controller we use an internal-SLB declare in the service with just only this annotation

    service.beta.kubernetes.io/azure-load-balancer-internal: 'true'

    adding just this other one

    service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: healthz

    resolve our issue.

    Because nginx respond HTTP 200 to /healthz but HTTP 404 to / which is the default path is you do not set with another annotation

    Somehow, and I don't know exactly the reason the curl command to the SLB worked but between Azure Services (AppGW health probes and SLB health probes) they understand that SLB was unhealthy because the change from TCP to HTTP.

    You can find more information here (the second link has the changes in probes from k8s 1.23 to 1.24):

    https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations

    https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe

    Now we have our AKS working using k8s 1.25.2

    I hope you find this interesting

    Regards

    0 comments No comments