VMs in Kubernetes VMSS start up when the cluster is stopped

Chris 26 Reputation points
2022-10-04T16:26:07.77+00:00

I have noticed that after I have stopped my Kubernetes cluster, the VMs in the cluster's VMSS have started up again after a period of time. The cluster status remains Succeeded (Stopped)

I stop my cluster as it is a test environment so only need it running during working hours using a runbook. The runbook runs every night at 6pm but if the cluster status is Succeeded (Stopped) it will not try and shut down the cluster and if the VMs have been started up in the VMSS, it will not try and shut them down so the VMs are incurring costs from the first time they are started up again.

As my VMs are Linux nodes, I believe that security updates are automatically applied. Is this causing the VMs to start up but not shut down?

node-updates-kured

Is there a way to prevent this from happening until I start my cluster up and then apply security updates?

The error below is shown on the VMSS after the VMs are started, which consistently happens just after 19:00:00 UTC:

Message
VM has reported a failure when processing extension 'vmssCSE'. Error message: "Enable failed: failed to execute command: command terminated with exit status=53 [stdout] { "ExitCode": "53", "Output": "XDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nServer:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53\n\n** server can't find [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io: NXDOMAIN\n\nExecuted \"nslookup [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io\" 100 times == [IP_ADDRESS] ]]\n+ VALIDATION_ERR=53\n+ REBOOTREQUIRED=false\n+ false\n+ [[ UBUNTU == UBUNTU ]]\n+ /usr/lib/apt/apt.systemd.daily\n+ aptmarkWALinuxAgent unhold\n+ echo 'Custom script finished. API server connection check code:' 53\nCustom script finished. API server connection check code: 53\n++ date\n++ date\n++ hostname\n+ echo Thu Sep 22 19:10:56 UTC 2022,aks-nodepool1-12197728-vmss00005L, startAptmarkWALinuxAgent unhold\nThu Sep 22 19:10:56 UTC 2022,aks-nodepool1-12197728-vmss00005L, startAptmarkWALinuxAgent unhold\n+ wait_for_apt_locks\n+ fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock\n++ hostname\n+ echo Thu Sep 22 19:10:56 UTC 2022,aks-nodepool1-12197728-vmss00005L, endcustomscript\n+ mkdir -p /opt/azure/containers\n+ touch /opt/azure/containers/provision.complete\n+ exit 53", "Error": "", "ExecDuration": "146", "KernelStartTime": "Thu 2022-09-22 19:07:54 UTC", "CSEStartTime": "Thu Sep 22 19:08:30 UTC 2022", "GuestAgentStartTime": "Thu 2022-09-22 19:08:23 UTC", "SystemdSummary": "Startup finished in 507ms (firmware) + 13.486s (loader) + 6.374s (kernel) + 47.635s (userspace) = 1min 8.004s\ngraphical.target reached after 28.854s in userspace", "BootDatapoints": { "KernelStartTime": "Thu 2022-09-22 19:07:54 UTC", "CSEStartTime": "Thu Sep 22 19:08:30 UTC 2022", "GuestAgentStartTime": "Thu 2022-09-22 19:08:23 UTC", "KubeletStartTime": "Thu 2022-09-22 19:09:14 UTC" } } [stderr] " More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot
Time
Thursday, 22 September 2022 at 20:11:08

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,893 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. shiva patpi 13,146 Reputation points Microsoft Employee
    2022-10-04T20:23:22.067+00:00

    Hello @Chris ,
    Sometimes this happens if VMSS are unable to communicate with API server address due to DNS failures.
    Are you using Azure Provided DNS or custom DNS?
    If it is custom DNS - kindly validate if there are any changes from DNS point of view.

    Basically, error code 53 indicates nodes are not able to resolve API server FQDN
    This node aks-nodepool1-12197728-vmss00005L was not able to communicate with API server

    As per the error message , It was trying to reach out to server over port 53 , - from the log "Server:\t\t[IP_ADDRESS]\nAddress:\t[IP_ADDRESS]#53"

    It tried nslookup for certain amount of time and failed on [KUBERNETES_CLUSTER].hcp.uksouth.azmk8s.io\" 100 times == [IP_ADDRESS]* ]]\n+ VALIDATION_ERR=53

    0 comments No comments

  2. Chris 26 Reputation points
    2022-10-05T08:21:15.937+00:00

    Hi @shiva patpi

    Thank you for your response.

    Am I correct is assuming this process takes place:

    1. The custom script vmssCSE that gets deployed with the cluster gets run automatically on Tuesdays and Thursdays from looking at my logs
    2. The script provisions my Linux VMs in my cluster ready for patching
    3. The script fails because it can't reach the API server FQDN
    4. The VMs are left running

    If this is true, the script will fail as my cluster is stopped as I am saving on billing costs. I can reach the API server if I start up my cluster but I don't want it running outside of my working hours.

    This article suggests that security updates are automatically applied but I would like clarification on whether my cluster needs to be running when patches are auto applied:

    https://learn.microsoft.com/en-us/azure/aks/node-updates-kured[node-updates-kured][1]

    If they are always applied at 7:00pm UTC, is it every Tuesday and Thursday?

    Do I need a runbook to make sure my cluster is running when patches are auto applied?

    0 comments No comments