Resolve general issues when using AKS hybrid

This article describes some general known issues that occur when using AKS hybrid. You can also review known issues with Windows Admin Center and installation issues and errors.

When running AksHci PowerShell cmdlets, an 'Unable to Load DLL' error appears.

Antivirus software may be causing this error by blocking the execution of PowerShell binaries that are required to perform cluster operations. An example of a similar error is shown below:

Deployment: Connecting to remote server localhost failed.

To resolve this issue, verify the following processes and folders (which are required to perform AKS cluster operations) are excluded from the antivirus software:

Processes:

  • kubectl.exe
  • kvactl.exe
  • mocctl.exe
  • nodectl.exe
  • wssdagent.exe
  • wssdcloudagent.exe
  • kubectl-adsso.exe
  • AksHciHealth.exe

Folders:

  • C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\
  • C:\Program Files\WindowsPowerShell\Modules\TraceProvider\
  • C:\Program Files\WindowsPowerShell\Modules\AksHci\
  • C:\Program Files\WindowsPowerShell\Modules\Az.Accounts\
  • C:\Program Files\WindowsPowerShell\Modules\Az.Resources\
  • C:\Program Files\WindowsPowerShell\Modules\AzureAD\
  • C:\Program Files\WindowsPowerShell\Modules\DownloadSdk\
  • C:\Program Files\WindowsPowerShell\Modules\Kva\
  • C:\Program Files\WindowsPowerShell\Modules\Microsoft.SME.CredSspPolicy\
  • C:\Program Files\WindowsPowerShell\Modules\Moc\
  • C:\Program Files\WindowsPowerShell\Modules\PackageManagement\
  • C:\Program Files\AksHci\
  • C:\AksHci\

Running Remove-AksHciCluster results in the error: 'Error: unable to delete group clustergroup-spdb:...'.

When running Remove-AksHciCluster, the following error occurs because there may be a deadlock:

Error: unable to delete group clustergroup-spdb: failed to delete group clustergroup-spdb: rpc error: code = DeadlineExceeded desc = context deadline exceeded

To resolve this issue, restart CloudAgent.

Error: invalid_client. The provided client secret keys are expired.

This error usually occurs when service principal (SPN) secret you used when running the PowerShell cmdlet running Enable-AksHciArcConnection expired.

Visit the Azure Portal to create a new secret for your service principal (SPN). You can also use certificate credentials for added security. For example of using the cmdlet, see Enable-AksHciArcConnection.

Insufficient privileges to complete the operation.

This error usually occurs when the service principal (SPN) or your Azure credentials (username and password) used to connect your AKS cluster don't have sufficient privileges in the Azure subscription to perform the operation.

Review the privilege requirements in Azure requirements for AKS clusters in AKS hybrid.

Running Remove-AksHciCluster results in the error: 'A workload cluster with the name 'my-workload-cluster' was not found'.

XXX

If you encounter this error when running Remove-AksHciCluster, you should check to make sure you have used the correct information for removing the cluster.

Transport: Error while dialing dial unix /var/run/moc-kms-plugin/kmsPlugin.sock: connect: no such file or directory.

This error happens when the KMS plugin on your AKS-HCI target cluster has stopped running because of an expired KMS plugin token.

Run Repair-AksHciCerts to fix this issue.

In a workload cluster with static IP addresses, all pods in a node are stuck in a 'ContainerCreating' state.

In a workload cluster with static IP addresses and Windows nodes, all of the pods in a node (including the daemonset pods) are stuck in a ContainerCreating state. When attempting to connect to that node using SSH, the connection fails with a Connection timed out error.

To resolve this issue, use Hyper-V Manager or Failover Cluster Manager to turn off the VM of that node. After 5 to 10 minutes, the node should have been recreated and with all the pods running.

Next steps

If you continue to run into problems when you're using AKS hybrid, you can file bugs through GitHub.