Troubleshoot Kubernetes management and workload cluster issues in AKS Arc

Applies to: AKS on Azure Stack HCI, AKS on Windows Server Use this article to help you troubleshoot and resolve issues on Kubernetes management and workload clusters in AKS Arc.

Running the Remove-ClusterNode command evicts the node from the failover cluster, but the node still exists

When running the Remove-ClusterNode command, the node is evicted from the failover cluster, but if Remove-AksHciNode isn't run afterwards, the node will still exist in CloudAgent.

Since the node was removed from the cluster, but not from CloudAgent, if you use the VHD to create a new node, a "File not found" error appears. This issue occurs because the VHD is in shared storage, and the removed node doesn't have access to it.

To resolve this issue, remove a physical node from the cluster and then follow these steps:

  1. Run Remove-AksHciNode to de-register the node from CloudAgent.
  2. Perform routine maintenance, such as re-imaging the machine.
  3. Add the node back to the cluster.
  4. Run Add-AksHciNode to register the node with CloudAgent.

When using large clusters, the Get-AksHciLogs command fails with an exception

With large clusters, the Get-AksHciLogs command may throw an exception, fail to enumerate nodes, or won't generate the c:\wssd\wssdlogs.zip output file.

This is because the PowerShell command to zip a file, 'Compress-Archive', has an output file size limit of 2 GB.

The certificate renewal pod is in a crash loop state

After upgrading or scaling up the workload cluster, the certificate renewal pod is now in a crash loop state because the pod is expecting the certificate tattoo YAML file from the file location /etc/Kubernetes/pki.

This issue may be due to a configuration file that's present in the control plane VMs but not in the worker node VMs.

To resolve this issue, manually copy the certificate tattoo YAML file from the control plane node to all worker nodes.

  1. Copy the YAML file from control plane VM on the workload cluster to the current directory of your host machine:
ssh clouduser@<comtrol-plane-vm-ip> -i (get-mocconfig).sshprivatekey
sudo cp /etc/kubernetes/pki/cert-tattoo-kubelet.yaml ~/
sudo chown clouduser cert-tattoo-kubelet.yaml
sudo chgrp clouduser cert-tattoo-kubelet.yaml
(Change file permissions here, so that `scp` will work)
scp -i (get-mocconfig).sshprivatekey clouduser@<comtrol-plane-vm-ip>:~/cert*.yaml .\
  1. Copy the YAML file from the host machine to the worker node VM. Before you copy the file, you must change the name of the machine in the YAML file to the node name to which you're copying (this is the node name for the workload cluster control plane).
scp -i (get-mocconfig).sshprivatekey .\cert-tattoo-kubelet.yaml clouduser@<workernode-vm-ip>:~/cert-tattoo-kubelet.yaml
  1. If the owner and group information in the YAML file isn't already set to root, set the information to the root:
ssh clouduser@<workernode-vm-ip> -i (get-mocconfig).sshprivatekey
sudo cp ~/cert-tattoo-kubelet.yaml /etc/kubernetes/pki/cert-tattoo-kubelet.yaml (copies the certificate file to the correct location)
sudo chown root cert-tattoo-kubelet.yaml
sudo chgrp root cert-tattoo-kubelet.yaml
  1. Repeat steps 2 and 3 for all worker nodes.

PowerShell deployment doesn't check for available memory before creating a new workload cluster

The Aks-Hci PowerShell commands don't validate the available memory on the host server before creating Kubernetes nodes. This issue can lead to memory exhaustion and virtual machines that don't start.

This failure is currently not handled gracefully, and the deployment will stop responding with no clear error message. If you have a deployment that stops responding, open Event Viewer and check for a Hyper-V-related error message indicating there's not enough memory to start the VM.

When running Get-AksHciCluster, a "release version not found" error occurs

When running Get-AksHciCluster to verify the status of an AKS Arc installation in Windows Admin Center, the output shows an error: "A release with version 1.0.3.10818 was NOT FOUND." However, when running Get-AksHciVersion, it showed the same version was installed. This error indicates that the build has expired.

To resolve this issue, run Uninstall-AksHci, and then run a new AKS Arc build.

Moving virtual machines between Azure Stack HCI cluster nodes quickly leads to VM startup failures

When using the cluster administration tool to move a VM from one node (Node A) to another node (Node B) in the Azure Stack HCI cluster, the VM may fail to start on the new node. After moving the VM back to the original node, it will fail to start there as well.

This issue happens because the logic to clean up the first migration runs asynchronously. As a result, Azure Kubernetes Service's "update VM location" logic finds the VM on the original Hyper-V on node A, and deletes it, instead of unregistering it.

To work around this issue, ensure the VM has started successfully on the new node before moving it back to the original node.

Attempt to increase the number of worker nodes fails

When using PowerShell to create a cluster with static IP addresses and then attempt to increase the number of worker nodes in the workload cluster, the installation gets stuck at "control plane count at 2, still waiting for desired state: 3." After a period of time, another error message appears: "Error: timed out waiting for the condition."

When Get-AksHciCluster was run, it showed that the control plane nodes were created and provisioned and were in a Ready state. However, when kubectl get nodes was run, it showed that the control plane nodes had been created but not provisioned and weren't in a Ready state.

If you get this error, verify that the IP addresses have been assigned to the created nodes using either Hyper-V Manager or PowerShell:

(Get-VM |Get-VMNetworkAdapter).IPAddresses |fl

Then, verify the network settings to ensure there are enough IP addresses left in the pool to create more VMs.

Error: You must be logged in to the server (Unauthorized)

Commands such as Update-AksHci, Update-AksHciCertificates, Update-AksHciClusterCertificates, and all interactions with the management cluster, can return "Error: You must be logged in to the server (Unauthorized)."

This error can occur when the kubeconfig file is expired. To check if it's expired, run the following script:

$kubeconfig= $(get-kvaconfig).kubeconfig
$certDataRegex = '(?ms).*client-certificate-data:\s*(?<CertData>[^\n\r]+)'
$certDataMatch = Select-String -Path $kubeconfig -Pattern $certDataRegex
$certData = $certDataMatch.Matches[0].Groups['CertData'].Value
$certObject = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2
$certObject.Import([System.Convert]::FromBase64String($certData))
$expiryDate = $certObject.GetExpirationDateString()
$expiryDate

If this script displays a date that is earlier than the current date, then the kubeconfig file has expired.

To mitigate the issue, copy the admin.conf file, which is inside the management control plane VM, to the HCI host:

  • SSH to the management control plane VM:

    • Get the management control plane VM IP:

      get-clusternode | % {Get-vm -computerName $_ } | get-vmnetworkadapter | select Name, IPAddresses
      
    • SSH into it:

      ssh -o StrictHostKeyChecking=no -i (Get-MocConfig).sshPrivateKey clouduser@<mgmt-control-plane-ip>
      
  • Copy the file to the clouduser location:

    cp /etc/kubernetes/admin.conf /home/clouduser/admin.conf
    
  • Give access to clouduser:

    sudo chown clouduser:clouduser admin.conf
    
  • [Optional] Create a backup of the kubeconfig file:

    mv $(Get-KvaConfig).kubeconfig "$((Get-KvaConfig).kubeconfig).backup"
    
  • Copy the file:

    scp -i (get-mocconfig).sshPrivateKey clouduser@10.64.92.14:~/admin.conf $(Get-KvaConfig).kubeconfig
    

Hyper-V manager shows high CPU and/or memory demands for the management cluster (AKS host)

When you check Hyper-V manager, high CPU and memory demands for the management cluster can be safely ignored. They're related to spikes in compute resource usage when provisioning workload clusters.

Increasing the memory or CPU size for the management cluster hasn't shown a significant improvement and can be safely ignored.

When using kubectl to delete a node, the associated VM might not be deleted

You'll see this issue if you follow these steps:

  1. Create a Kubernetes cluster.
  2. Scale the cluster to more than two nodes.
  3. Delete a node by running the following command:
kubectl delete node <node-name>
  1. Return a list of the nodes by running the following command:
kubectl get nodes

The removed node isn't listed in the output. 5. Open a PowerShell window with administrative privileges, and run the following command:

get-vm

The removed node is still listed.

This failure causes the system not to recognize that the node is missing, and therefore, a new node won't spin up.

If a management or workload cluster isn't used for more than 60 days, the certificates will expire

If you don't use a management or workload cluster for longer than 60 days, the certificates expire, and you must renew them before you can upgrade AKS Arc. When an AKS cluster isn't upgraded within 60 days, the KMS plug-in token and the certificates both expire within the 60 days. The cluster is still functional. However, since it's beyond 60 days, you need to call Microsoft support to upgrade. If the cluster is rebooted after this period, it will remain in a non-functional state.

To resolve this issue, run the following steps:

  1. Repair the management cluster certificate by manually rotating the token, and then restart the KMS plug-in and container.
  2. Repair the mocctl certificates by running Repair-MocLogin.
  3. Repair the workload cluster certificates by manually rotating the token, and then restart the KMS plug-in and container.

The workload cluster isn't found

The workload cluster may not be found if the IP address pools of two AKS on Azure Stack HCI deployments are the same or overlap. If you deploy two AKS hosts and use the same AksHciNetworkSetting configuration for both, PowerShell and Windows Admin Center will potentially fail to find the workload cluster as the API server will be assigned the same IP address in both clusters resulting in a conflict.

The error message you receive will look similar to the example shown below.

A workload cluster with the name 'clustergroup-management' was not found.
At C:\Program Files\WindowsPowerShell\Modules\Kva\0.2.23\Common.psm1:3083 char:9
+         throw $("A workload cluster with the name '$Name' was not fou ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (A workload clus... was not found.:String) [], RuntimeException
    + FullyQualifiedErrorId : A workload cluster with the name 'clustergroup-management' was not found.

Note

Your cluster name will be different.

New-AksHciCluster times out when creating an AKS cluster with 200 nodes

The deployment of a large cluster may time out after two hours. However, this is a static time-out.

You can ignore this time-out occurrence as the operation is running in the background. Use the kubectl get nodes command to access your cluster and monitor the progress.

The API server isn't responsive after several days

When attempting to bring up an AKS on Azure Stack HCI deployment after a few days, Kubectl didn't execute any of its commands. The Key Management Service (KMS) plug-in log displayed the error message rpc error:code = Unauthenticated desc = Valid Token Required_. After running [Repair-AksHciCerts](./reference/ps/repair-akshcicerts.md) to try to fix the issue, a different error appeared: _failed to repair cluster certificates.

The Repair-AksHciClusterCerts cmdlet fails if the API server is down. If AKS on Azure Stack HCI hasn't been upgraded for 60 or more days, when you try to restart the KMS plug-in, it won't start. This failure also causes the API server to fail.

To fix this issue, you need to manually rotate the token and then restart the KMS plug-in and container to get the API server backup. To do this, run the following steps:

  1. Rotate the token by running the following command:

    $ mocctl.exe security identity rotate --name "KMSPlugin-<cluster-name>-moc-kms-plugin" --encode=false --cloudFqdn (Get-AksHciConfig).Moc.cloudfqdn > cloudlogin.yaml
    
  2. Copy the token to the VM using the following command. The ip setting in the command below refers to the IP address of the AKS host's control plane.

    $ scp -i (Get-AksHciConfig).Moc.sshPrivateKey .\cloudlogin.yaml clouduser@<ip>:~/cloudlogin.yaml $ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> sudo mv cloudlogin.yaml /opt/wssd/k8s/cloudlogin.yaml
    
  3. Restart the KMS plug-in and the container.

    To get the container ID, run the following command:

    $ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> "sudo docker container ls | grep 'kms'"
    

    The output should appear with the following fields:

    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    

    The output should look similar to this example:

    4169c10c9712 f111078dbbe1 "/kms-plugin" 22 minutes ago Up 22 minutes k8s_kms-plugin_kms-plugin-moc-lll6bm1plaa_kube-system_d4544572743e024431874e6ba7a9203b_1
    
  4. Finally, restart the container by running the following command:

    $ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> "sudo docker container kill <Container ID>"
    

Creating a workload cluster fails with the error "A parameter cannot be found that matches parameter name nodePoolName"

On an AKS on Azure Stack HCI installation with the Windows Admin Center extension version 1.82.0, the management cluster was set up using PowerShell, and an attempt was made to deploy a workload cluster using Windows Admin Center. One of the machines had PowerShell module version 1.0.2 installed, and other machines had PowerShell module 1.1.3 installed. The attempt to deploy the workload cluster failed with the error "A parameter can't be found that matches parameter name 'nodePoolName'." This error may have occurred because of a version mismatch. Starting with PowerShell version 1.1.0, the -nodePoolName <String> parameter was added to the New-AksHciCluster cmdlet, and by design, this parameter is now mandatory when using the Windows Admin Center extension version 1.82.0.

To resolve this issue, do one of the following:

  • Use PowerShell to manually update the workload cluster to version 1.1.0 or later.
  • Use Windows Admin Center to update the cluster to version 1.1.0 or to the latest PowerShell version.

This issue doesn't occur if the management cluster is deployed using Windows Admin Center as it already has the latest PowerShell modules installed.

When running 'kubectl get pods', pods were stuck in a 'Terminating' state

When you deploy AKS on Azure Stack HCI, and then run kubectl get pods, pods in the same node are stuck in the Terminating state. The machine rejects SSH connections because the node is likely experiencing high memory demand. This issue occurs because the Windows nodes are over-provisioned, and there's no reserve for core components.

To avoid this situation, add the resource limits and resource requests for CPU and memory to the pod specification to ensure that the nodes aren't over-provisioned. Windows nodes don't support eviction based on resource limits, so you should estimate how much the containers will use and then ensure the nodes have sufficient CPU and memory amounts. You can find more information in system requirements.

Cluster auto-scaling fails

Cluster auto-scaling can fail when you use the following Azure policy on your AKS host management cluster: "Kubernetes clusters should not use the default namespace." This happens because the policy, when implemented on the AKS host management cluster, blocks the creation of autoscaler profiles in the default namespace. To fix this issue, choose one of the following options:

When creating a new workload cluster, the error "Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded" occurs

This is a known issue with the AKS on Azure Stack HCI July Update (version 1.0.2.10723). The error occurs because the CloudAgent service times out during distribution of virtual machines across multiple cluster shared volumes in the system.

To resolve this issue, you should upgrade to the latest AKS on Azure Stack HCI release.

The Windows or Linux node count can't be seen when Get-AksHciCluster is run

If you provision an AKS cluster on Azure Stack HCI with zero Linux or Windows nodes, when you run Get-AksHciCluster, the output will be an empty string or null value.

Null is an expected return for zero nodes.

If a cluster is shut down for more than four days, the cluster will be unreachable

When you shut down a management or workload cluster for more than four days, the certificates expire and the cluster is unreachable. The certificates expire because they're rotated every 3-4 days for security reasons.

To restart the cluster, you need to manually repair the certificates before you can perform any cluster operations. To repair the certificates, run the following Repair-AksHciClusterCerts command:

Repair-AksHciClusterCerts -Name <cluster-name> -fixKubeletCredentials

Linux and Windows VMs weren't configured as highly available VMs when scaling a workload cluster

When scaling out a workload cluster, the corresponding Linux and Windows VMs were added as worker nodes, but they weren't configured as highly available VMs. When running the Get-ClusterGroup command, the newly created Linux VM wasn't configured as a Cluster Group.

This is a known issue. After a reboot, the ability to configure VMs as highly available is sometimes lost. The current workaround is to restart wssdagent on each of the Azure Stack HCI nodes. This works only for new VMs that are generated by creating node pools when performing a scale-out operation or when creating new Kubernetes clusters after restarting the wssdagent on the nodes. However, you'll have to manually add the existing VMs to the failover cluster.

When you scale down a cluster, the high-availability cluster resources are in a failed state while the VMs are removed. The workaround for this issue is to manually remove the failed resources.

Attempt to create new workload clusters failed because the AKS host was turned off for several days

An AKS cluster deployed in an Azure VM was previously working fine, but after the AKS host was turned off for several days, the Kubectl command didn't work. After running either the Kubectl get nodes or Kubectl get services command, this error message appeared: Error from server (InternalError): an error on the server ("") has prevented the request from succeeding.

This issue occurred because the AKS host was turned off for longer than four days, which caused the certificates to expire. Certificates are frequently rotated in a four-day cycle. Run Repair-AksHciClusterCerts to fix the certificate expiration issue.

In a workload cluster with static IP addresses, all pods in a node are stuck in a _ContainerCreating_ state

In a workload cluster with static IP addresses and Windows nodes, all of the pods in a node (including the daemonset pods) are stuck in a ContainerCreating state. When attempting to connect to that node using SSH, the connection fails with a Connection timed out error.

To resolve this issue, use Hyper-V Manager or Failover Cluster Manager to turn off the VM of that node. After 5 to 10 minutes, the node should have been recreated, with all the pods running.

ContainerD is unable to pull the pause image as 'kubelet' mistakenly collects the image

When kubelet is under disk pressure, it collects unused container images, which may include pause images, and when this happens, ContainerD can't pull the image.

To resolve this issue, run the following steps:

  1. Connect to the affected node using SSH, and run the following command:
sudo su
  1. To pull the image, run the following command:
crictl pull ecpacr.azurecr.io/kube-proxy:<Kubernetes version>