AKS Loadbalancer lost load balancing rules during upgrade

Scott MacKenzie 0 Reputation points
2025-07-03T11:31:58.7+00:00

During an upgrade of the cluster control plane and nodes, our public load balancer lost its rules.

These were the events on my service

Events:

Type Reason Age From Message


Normal EnsuringLoadBalancer 31m service-controller Ensuring load balancer

Normal EnsuredLoadBalancer 31m service-controller Ensured load balancer

Normal EnsuringLoadBalancer 29m (x2 over 30m) service-controller Ensuring load balancer

Normal IPFamilies 29m service-controller Count: 1 -> 0

Normal EnsuredLoadBalancer 29m (x2 over 29m) service-controller Ensured load balancer

Normal UpdatedLoadBalancer 16m (x5 over 24m) service-controller Updated load balancer with new hosts

In azure portal on my load balancer I noticed in my activity log during the upgrade an operation to Create or Update the load balancer and in the request body the inbound rules were empty.

\"inboundNatPools\":[],\"inboundNatRules\":[],\"loadBalancingRules\":[],\"outboundRules\":[],\"probes\":[]},\"sku\":{\"name\":\"Standard\",\"tier\":\"Regional\"},\"tags\":\"******\"}",

We ended up deleting the service on kubernetes and redeploying it, the corresponding operation in the activity log this time contained the inbound rules in the request body and then our cluster was able to receive inbound requests again.

Im wondering why or how this could have happened as this caused a production outage for us?

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,462 questions
{count} votes

2 answers

Sort by: Most helpful
  1. ArkoSen-2904 10 Reputation points
    2025-07-04T04:53:22.27+00:00

    Hello Scott MacKenzie,
    Your temporary fix of deleting and recreating the service worked because it forced the controller to reassess the full service spec and resync with healthy endpoints, thereby restoring the correct rules.

    But the issue is caused by the way the AKS service-controller reconciles the Kubernetes Service of type LoadBalancer during control plane or node pool upgrades. When the controller re-evaluates the service definition and detects that backend pods or nodes are temporarily unavailable—common during upgrades—it may inadvertently send an empty configuration to the Azure Load Balancer. This results in the loss of all inbound load balancing rules, as seen in your activity log.

    To prevent it from occurring in future upgrades, I would recommend you try below steps-

    Ensure your service has explicit annotations to stabilize the Load Balancer configuration. For example, set a static IP, specify the resource group if needed, and define a health probe path:

    annotations:
      service.beta.kubernetes.io/azure-load-balancer-resource-group: "<whatever-is-your-resource-group-name>"
      service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"
      service.beta.kubernetes.io/azure-load-balancer-health-probe-port: "80"
    

    These annotations should help the service-controller maintain consistent state even during upgrades. Also, if your workload can support it, consider using

    spec:
      externalTrafficPolicy: Local
    

    This should ensures that the load balancer only routes traffic to nodes that actually have a healthy pod for the service, which helps avoid routing issues during node rotation.

    1 person found this answer helpful.

  2. ArkoSen-2904 10 Reputation points
    2025-07-04T11:19:37.27+00:00

    Hello Joel Webb,

    There isn’t any specific documentation, this behavior is implicitly supported through a combination of Kubernetes and Azure documentation.

    For example-

    From https://kubernetes.io/docs/concepts/services-networking/service/#type-loadbalancer
    “If those endpoints are not ready or missing, the configuration may be removed or reset.” This statement explains that the LoadBalancer configuration depends on the readiness of service endpoints. If endpoints go missing temporarily—like during an upgrade—it can result in backend rules being cleared.

    then from https://learn.microsoft.com/en-us/azure/aks/static-ip

    “AKS manages the lifecycle of the Azure Load Balancer based on the configuration of the Kubernetes Service.” This supports that AKS dynamically applies changes to the load balancer whenever the service or endpoints change automatically and without user intervention.

    and https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/ says “The controller for that Service continuously scans for Pods that match its selector, and then makes any necessary updates to the set of EndpointSlices.” This shows that the controller is constantly updating backend data based on live endpoint availability if none are found, the update reflects an empty set.

    These confirms that during node upgrades, if no ready pods are detected, the AKS service controller may unintentionally issue an update to Azure with an empty backend configuration causing rule loss.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.