Hello Scott MacKenzie,
Your temporary fix of deleting and recreating the service worked because it forced the controller to reassess the full service spec and resync with healthy endpoints, thereby restoring the correct rules.
But the issue is caused by the way the AKS service-controller reconciles the Kubernetes Service of type LoadBalancer during control plane or node pool upgrades. When the controller re-evaluates the service definition and detects that backend pods or nodes are temporarily unavailable—common during upgrades—it may inadvertently send an empty configuration to the Azure Load Balancer. This results in the loss of all inbound load balancing rules, as seen in your activity log.
To prevent it from occurring in future upgrades, I would recommend you try below steps-
Ensure your service has explicit annotations to stabilize the Load Balancer configuration. For example, set a static IP, specify the resource group if needed, and define a health probe path:
annotations:
service.beta.kubernetes.io/azure-load-balancer-resource-group: "<whatever-is-your-resource-group-name>"
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"
service.beta.kubernetes.io/azure-load-balancer-health-probe-port: "80"
These annotations should help the service-controller maintain consistent state even during upgrades. Also, if your workload can support it, consider using
spec:
externalTrafficPolicy: Local
This should ensures that the load balancer only routes traffic to nodes that actually have a healthy pod for the service, which helps avoid routing issues during node rotation.