Resolve issues and errors during an AKS Arc installation

Applies to: AKS on Azure Stack HCI, AKS on Windows Server This article describes known issues and errors you may encounter when installing AKS Arc. You can also review known issues with when upgrading AKS Arc and when using Windows Admin Center.

Error "Failed to wait for addon arc-onboarding"

This error message appears after running Install-AksHci.

Note

The error might be caused by having Private Link enabled on the setup. Currently, there is no workaround for this scenario. AKS on HCI does not work with Private Link.

If you are not using Private Link, to resolve this issue use the following steps:

  1. Open PowerShell and run Uninstall-AksHci.
  2. Open the Azure portal and navigate to the resource group you used when running Install-AksHci.
  3. Check for any connected cluster resources that appear in a Disconnected state and include a name shown as a randomly generated GUID.
  4. Delete these cluster resources.
  5. Close the PowerShell session and open new session before running Install-AksHci again.

Error: 'Install-AksHci Failed, Service returned an error. Status=403 Code="RequestDisallowedByPolicy"' error when installing AKS-HCI

This error may be caused by the installation process attempting to violate an Azure policy that's been set on the Azure subscription or resource group provided during the Azure Arc onboarding process. This error may occur for users who have defined Azure Policies at a subscription or resource group level, and then attempt to install AKS on Azure Stack HCI which violates an Azure Policy.

To resolve this issue, read the error message to understand which Azure Policy set by your Azure administrator has been violated, and then modify the Azure policy by making an exception to the Azure policy. To learn more about Policy exceptions, see Azure Policy exemption structure.

Error: Install-AksHci failed with error - [The object already exists] An error occurred while creating resource 'IPv4 Address xxx.xx.xx.xx' for the clustered role 'xx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx'

A previously installed feature remains in a failed state and has not been cleared. You may see the following error:

Exception [An error occurred while creating resource 'MOC Cloud Agent Service' for the clustered role 'ca-3f72bdeb-xxxx-4ae9-a721-3aa902a998f0'.]
Stacktrace [at Add-FailoverClusterGenericRole, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Common.psm1: line 2987
at Install-CloudAgent, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1310
at Install-MocAgents, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1229
at Initialize-Cloud, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1135
at Install-MocInternal, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1078
at Install-Moc, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 207
at Install-AksHciInternal, C:\Program Files\WindowsPowerShell\Modules\AksHci\1.1.25\AksHci.psm1: line 3867
at Install-AksHci, C:\Program Files\WindowsPowerShell\Modules\AksHci\1.1.25\AksHci.psm1: line 778
at <ScriptBlock>, <No file>: line 1]
InnerException[The object already exists]

Or you may see:

Install-Moc failed.
Exception [Unable to save property changes for 'IPv4 Address xxx.168.18.0'.]
Stacktrace [at Add-FailoverClusterGenericRole, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Common.psm1: line 2971
at Install-CloudAgent, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1310
at Install-MocAgents, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1229
at Initialize-Cloud, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1135
at Install-MocInternal, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 1078
at Install-Moc, C:\Program Files\WindowsPowerShell\Modules\Moc\1.0.20\Moc.psm1: line 207
at Install-AksHciInternal, C:\Program Files\WindowsPowerShell\Modules\AksHci\1.1.25\AksHci.psm1: line 3867
at Install-AksHci, C:\Program Files\WindowsPowerShell\Modules\AksHci\1.1.25\AksHci.psm1: line 778
at <ScriptBlock>, <No file>: line 1]
InnerException[A matching cluster network for the specified IP address could not be found]

To resolve this issue, manually clean up the cluster role. You can remove the resource from the failover cluster manager by running the following PowerShell cmdlet: Remove-ClusterResource -name <resource name>.

Error: "GetRelease error returned by API call: File download error: Hash mismatch"

The Install-AksHci cmdlet fails with "GetRelease error returned by API call: File download error: Hash mismatch."

  1. Open PowerShell and run Uninstall-AksHci.
  2. Retry an installation.
  3. If the issue persists, use the -concurrentDownloads parameter with Set-AksHciConfig and set it to a number lower than the default 10 before retrying an installation. Reducing the number of concurrent downloads may help sensitive networks complete large file downloads successfully. This parameter is a preview feature.

After deploying AKS on Azure Stack HCI 21H2, rebooting the nodes showed a failed status for billing

After deployment, when rebooting the Azure Stack HCI nodes, the AKS report showed a failed status for billing.

To resolve this issue, follow the instructions to manually rotate the token and restart the KMS plug-in.

Install-AksHci timed out with the error ''

After running Install-AksHci, the installation stopped and displayed the following error message:

\kubectl.exe --kubeconfig=C:\AksHci\0.9.7.3\kubeconfig-clustergroup-management 
get akshciclusters -o json returned a non zero exit code 1 
[Unable to connect to the server: dial tcp 192.168.0.150:6443: 
connectex: A connection attempt failed because the connected party 
did not properly respond after a period of time, or established connection 
failed because connected host has failed to respond.]

There are multiple reasons why an installation might fail with the waiting for API server error.

The following section outlines possible causes and solutions for this error.

Reason 1: Incorrect IP gateway configuration If you're using static IP addresses and you received the following error message, confirm that the configuration for the IP address and gateway is correct.

Install-AksHci 
C:\AksHci\kvactl.exe create --configfile C:\AksHci\yaml\appliance.yaml  --outfile C:\AksHci\kubeconfig-clustergroup-management returned a non-zero exit code 1 [ ]

To check whether you have the right configuration for your IP address and gateway, run the following command:

ipconfig /all

In the displayed configuration settings, confirm the configuration. You could also attempt to ping the IP gateway and DNS server.

ping <DNS server>

If these methods don't work, use New-AksHciNetworkSetting to change the configuration.

Reason 2: Incorrect DNS server If you're using static IP addresses, confirm that the DNS server is correctly configured. To check the host's DNS server address, use the following command:

Get-NetIPConfiguration.DNSServer | ?{ $_.AddressFamily -ne 23} ).ServerAddresses

Confirm that the DNS server address is the same as the address used when running New-AksHciNetworkSetting by running the following command:

Get-MocConfig

If the DNS server has been incorrectly configured, reinstall AKS on Azure Stack HCI with the correct DNS server. For more information, see Restart, remove, or reinstall Azure Kubernetes Service on Azure Stack HCI .

The issue was resolved after deleting the configuration and restarting the VM with a new configuration.

Error: "The process cannot access the file 'mocstack.cab' because it is being used by another process"

Install-AksHci failed with this error because another process is accessing mocstack.cab.

To resolve this issue, close all open PowerShell windows and then reopen a new PowerShell window.

Error: Install-AksHci fails with 'Install-MOC failed with the error - the process cannot access the file \<path> because it is being used by another process.'

The file cannot be accessed because it is being used by another process.

You can resolve this issue by restarting your PowerShell session. Close the PowerShell window and try Install-AksHci again.

Error: "An existing connection was forcibly closed by the remote host"

Install-AksHci failed with this error because the IP pool ranges provided in the AKS on Azure Stack HCI configuration was off by 1 in the CIDR, and can cause CloudAgent to crash. For example, if you have subnet 10.0.0.0/21 with an address range 10.0.0.0 - 10.0.7.255, and then you use start address of 10.0.0.1 or an end address of 10.0.7.254, then this would cause CloudAgent to crash.

To work around this issue, run New-AksHciNetworkSetting, and use any other valid IP address range for your VIP pool and Kubernetes node pool. Make sure that the values you use are not off by 1 at the start or end of the address range.

Install-AksHci failed on a multi-node installation with the error 'Nodes have not reached active state'

When running Install-AksHci on a single-node setup, the installation worked, but when setting up the failover cluster, the installation fails with the error message. However, pinging the cloud agent showed the CloudAgent was reachable.

To ensure all nodes can resolve the CloudAgent's DNS, run the following command on each node:

Resolve-DnsName <FQDN of cloudagent>

When the step above succeeds on the nodes, make sure the nodes can reach the CloudAgent port to verify that a proxy is not trying to block this connection and the port is open. To do this, run the following command on each node:

Test-NetConnection  <FQDN of cloudagent> -Port <Cloudagent port - default 65000>

The AKS on Azure Stack HCI download package fails with the error: 'msft.sme.aks couldn't load'

The error stems from an error with download.

If you get this error, you should use the latest version of Microsoft Edge or Google Chrome and try again.

When running Set-AksHciRegistration, an error 'Unable to check registered Resource Providers' appears

This error appears after running Set-AksHciRegistration in an AKS on Azure Stack HCI installation. The error indicates the Kubernetes Resource Providers are not registered for the tenant that is currently logged in.

To resolve this issue, run either the Azure CLI or the PowerShell steps below:

az provider register --namespace Microsoft.Kubernetes
az provider register --namespace Microsoft.KubernetesConfiguration
Register-AzResourceProvider -ProviderNamespace Microsoft.Kubernetes
Register-AzResourceProvider -ProviderNamespace Microsoft.KubernetesConfiguration

The registration takes approximately 10 minutes to complete. To monitor the registration process, use the following commands.

az provider show -n Microsoft.Kubernetes -o table
az provider show -n Microsoft.KubernetesConfiguration -o table
Get-AzResourceProvider -ProviderNamespace Microsoft.Kubernetes
Get-AzResourceProvider -ProviderNamespace Microsoft.KubernetesConfiguration

Install-AksHci hangs in the 'Waiting for azure-arc-onboarding to complete' stage before timing out

Note

This issue is fixed in the May 2022 release and later.

Install-AksHci hangs at Waiting for azure-arc-onboarding to complete before timing out when:

  • A service principal is used in AKS on Azure Stack HCI Registration (Set-AksHciRegistration).
  • Az.Accounts PowerShell modules version(2.7.x) installed.

Az.Accounts 2.7.x versions removes the ServicePrincipalSecret and CertificatePassword in PSAzureRmAccount, which is used by AKS on Azure Stack HCI for Azure Arc onboarding.

To reproduce:

  1. Install Az.Accounts PowerShell modules version (>= 2.7.0).
  2. Set-AksHciRegistration using a service principal.
  3. Install-AksHci.

Expected behavior:

  1. The AKS on Azure Stack HCI installation hangs at Waiting for azure-arc-onboarding to complete.
  2. Azure-arc-onboarding pods goes into crash loop.
  3. The Azure-arc-onboarding pods error with the following error:
    Starting onboarding process ERROR: variable CLIENT_SECRET is required

To resolve this issue:

Uninstall Az.Accounts modules with versions 2.7.x. run the following cmdlet:

Uninstall-Module -Name Az.Accounts -RequiredVersion 2.7.0 -Force

During installation, this error appears: 'unable to create appliance VM: cannot create virtual machine: rpc error = unknown desc = Exception occurred. (Generic failure)]'

This error occurs when Azure Stack HCI is out of policy. The connection status on the cluster may show it's connected, but the event log shows the warning message that Azure Stack HCI's subscription is expired, run Sync-AzureStackHCI to renew the subscription.

To resolve this error, verify that the cluster is registered with Azure using the Get-AzureStackHCI PowerShell cmdlet that's available on your machine. The Windows Admin Center dashboard also shows status information about the cluster's Azure registration.

If the cluster is already registered, then you should view the LastConnected field in the output of Get-AzureStackHCI. If the field shows it's been more than 30 days, you should attempt to resolve the situation by using the Sync-AzureStackHCI cmdlet.

You can also validate whether each node of your cluster has the required license by using the following cmdlet:

Get-ClusterNode | % { Get-AzureStackHCISubscriptionStatus -ComputerName $_ }
Computer Name Subscription Name           Status   Valid To
------------- -----------------           ------   --------
MS-HCIv2-01   Azure Stack HCI             Active   12/23/2021 12:00:14 AM
MS-HCIv2-01   Windows Server Subscription Inactive

MS-HCIv2-02   Azure Stack HCI             Active   12/23/2021 12:00:14 AM
MS-HCIv2-02   Windows Server Subscription Inactive

MS-HCIv2-03   Azure Stack HCI             Active   12/23/2021 12:00:14 AM
MS-HCIv2-03   Windows Server Subscription Inactive

If the issue isn't resolved after running the Sync-AzureStackHCI cmdlet, you should reach out to Microsoft support.

After a failed installation, running Install-AksHci does not work

This issue happens because a failed installation may result in leaked resources that have to be cleaned up before you can install again.

If your installation fails using Install-AksHci, you should run Uninstall-AksHci before running Install-AksHci again.

Error: "unable to reconcile virtual network" or "Error: Install-Moc failed with error - Exception [[Moc] This machine does not appear to be configured for deployment]"

You can trigger these errors when you run Install-AksHci without running Set-AksHciConfig first.

To resolve the error, run uninstall-akshci and close all PowerShell windows. Open a new PowerShell session, and restart your AKS-HCI installation process by following installing AKS-HCI using PowerShell.

Set-AksHciConfig fails with the error "GetCatalog error returned by API call: ... proxyconnect tcp: tls: first record does not look like a TLS Handshake"

The Set-AksHciConfig PowerShell cmdlet fails with the error:

GetCatalog error returned by API call: ... proxyconnect tcp: tls: first record does not look like a TLS Handshake

If you are using AKS with a proxy server, you might have used the wrong URL when setting the required HTTPS proxy URL value. The HTTP proxy URL and HTTPS proxy URL values are both required when configuring AKS with a proxy server, but it's common to need both values to share the same HTTP-prefixed URL.

If this might be the case in your environment, try the following mitigation steps:

  1. Close the PowerShell window and open a new one.
  2. Run the New-AksHciNetworkSetting and New-AksHciProxySetting cmdlets again. When running New-AksHciProxySetting, set the -https parameter with the same HTTP-prefixed URL value that you set for -http.
  3. Run Set-AksHciConfig and proceed.

When you deploy AKS on Azure Stack HCI with a misconfigured network, deployment times out at various points

When you deploy AKS on Azure Stack HCI, the deployment may time out at different points of the process depending on where the misconfiguration occurred. You should review the error message to determine the cause and where it occurred.

For example, in the following error, the point at which the misconfiguration occurred is in Get-DownloadSdkRelease -Name "mocstack-stable":

$vnet = New-AksHciNetworkSettingSet-AksHciConfig -vnet $vnetInstall-AksHciVERBOSE: 
Initializing environmentVERBOSE: [AksHci] Importing ConfigurationVERBOSE: 
[AksHci] Importing Configuration Completedpowershell : 
GetRelease - error returned by API call: 
Post "https://msk8s.api.cdp.microsoft.com/api/v1.1/contents/default/namespaces/default/names/mocstack-stable/versions/0.9.7.0/files?action=generateDownloadInfo&ForegroundPriority=True": 
dial tcp 52.184.220.11:443: connectex: 
A connection attempt failed because the connected party did not properly
respond after a period of time, or established connection failed because
connected host has failed to respond.At line:1 char:1+ powershell -command
{ Get-DownloadSdkRelease -Name "mocstack-stable"}

This indicates that the physical Azure Stack HCI node can resolve the name of the download URL, msk8s.api.cdp.microsoft.com, but the node can't connect to the target server.

To resolve this issue, you need to determine where the breakdown occurred in the connection flow. Here are some steps to try to resolve the issue from the physical cluster node:

  1. Ping the destination DNS name: ping msk8s.api.cdp.microsoft.com.
  2. If you get a response back and no time-out, then the basic network path is working.
  3. If the connection times out, then there could be a break in the data path. For more information, see check proxy settings. Or, there could be a break in the return path, so you should check the firewall rules.

Set-AksHciConfig fails with WinRM errors, but shows WinRM is configured correctly

When running Set-AksHciConfig, you might encounter the following error:

WinRM service is already running on this machine.
WinRM is already set up for remote management on this computer.
Powershell remoting to TK5-3WP08R0733 was not successful.
At C:\Program Files\WindowsPowerShell\Modules\Moc\0.2.23\Moc.psm1:2957 char:17
+ ...             throw "Powershell remoting to "+$env:computername+" was n ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (Powershell remo...not successful.:String) [], RuntimeException
    + FullyQualifiedErrorId : Powershell remoting to TK5-3WP08R0733 was not successful.

This error usually occurs as a result of a change in the user's security token (due to a change in group membership), a password change, or an expired password. In most cases, the issue can be remediated by logging off from the computer and logging back in. If it still fails, you can file an issue at GitHub AKS HCI issues.

Moc agent log rotation is failing

Moc agents are expected to keep only the last 100 agent logs. They are supposed to delete the older logs. However, the log rotation is not happening and logs keep getting accumulated consuming disk space.

To reproduce: Install AksHci and have a cluster up and running until the number of agent logs exceeds 100. At the time of the nth log creation, the agents are expected to delete the n-100th log, if they exist.

To resolve the issue:

  1. Modify the cloud agent and node agents' logconf files. Cloud agent logconfig is located at:
    (Get-MocConfig).cloudConfigLocation+"\log\logconf".
    Node agent logconfig is located at:
    (Get-MocConfig).cloudConfigLocation+"\log\logconf".

  2. Change the value of Limit to 100 and Slots to 100 and save the configuration files.

  3. Restart the cloud agent and node agents to register these changes.

These steps start the log rotation only after 100 new logs are generated from the agent restart. If there are already n agent logs at the time of restart, log rotation will start only after n+100 logs are generated.

Cloud agent may fail to successfully start when using path names with spaces in them

When using Set-AksHciConfig to specify -imageDir, -workingDir, -cloudConfigLocation, or -nodeConfigLocation parameters with a path name that contains a space character, such as D:\Cloud Share\AKS HCI, the cloud agent cluster service will fail to start with the following (or similar) error message:

Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group os in the 'failed' state. Resources in 'failed' or 'pending' states: 'MOC Cloud Agent Service'

To work around this issue, use a path that does not include spaces, for example, C:\CloudShare\AKS-HCI.

Error: 'Install-Moc failed with error - Exception [CloudAgent is unreachable. MOC CloudAgent might be unreachable for the following reasons]'

This error may occur when there's an infrastructure misconfiguration.

Use the following steps to resolve this error:

  1. Check the host DNS server configuration and gateway settings:

    1. Confirm that the DNS server is correctly configured. To check the host's DNS server address, run the following command:
      ((Get-NetIPConfiguration).DNSServer | ?{ $_.AddressFamily -ne 23}).ServerAddresses
      
    2. To check whether your IP address and gateway configuration are correct, run the command ipconfig/all.
    3. Attempt to ping the IP gateway and the DNS server.
  2. Check the CloudAgent service to make sure it's running:

    1. Ping the CloudAgent service to ensure it's reachable.
    2. Make sure all nodes can resolve the CloudAgent's DNS by running the following command on each node:
      Resolve-DnsName <FQDN of cloudagent>
      
    3. When the previous step succeeds on the nodes, make sure the nodes can reach the CloudAgent port to verify that a proxy is not trying to block this connection and the port is open. To do this, run the following command on each node:
      Test-NetConnection <FQDN of cloudagent> -Port <Cloudagent port - default 65000>
      
    4. To check if the cluster service is running for a failover cluster, you can also run the following command:
      Get-ClusterGroup -Name (Get-AksHciConfig).Moc['clusterRoleName']
      

This typically indicates that the Cluster Name Object (CNO) representing your underlying failover cluster in Active Directory Domain Services (AD DS) does not have permissions to create a Virtual Computer Object (VCO) in the Organizational Unit (OU) or in the container where the cluster resides.

If you are not a domain administrator, you can ask one to grant the CNO permissions to the OU or prestage a VCO for the cloud agent generic cluster service.

If you are a domain administrator, it is still possible that your OU or container does not have the required permissions. For example, Enforcement mode, introduced in KB5008383, may be enabled in Active Directory. Try the following before attempting a reinstall.

  1. Navigate to Active Directory Users and Computers.
  2. Right-click on the OU or container where the cluster resides.
  3. Select Delegate Control... to open the Delegation of Control Wizard.
  4. Click Next > Click Add... to open the Select Users, Computers, or Groups window.
  5. Select your choice of group or users to whom you want to delegate control > Click OK.
  6. Select Create a custom task to delegate > Click Next to move on to the Active Directory Object Type page.
  7. Select Only the following objects in the folder > Select Computer objects > Select Create selected objects in this folder and Delete selected objects in this folder > Click Next to move on to the Permissions page.
  8. Select Create all Child Objects and Delete All Child Objects from the list of permissions > Click Next > Finish

If a reinstall fails, retry the above with the following changes to Steps 7 and 8:

  • Step 7: Select This folder, existing objects in this folder, and creation of new objects in this folder > Click Next.
  • Step 8: Select Read, Write, Create All Child Objects, and Delete All Child Objects from the list of permissions > Click Next > Click Finish.

Error: Install-AksHci fails with 'Install-Moc failed. Logs are available C:\Users\xxx\AppData\Local\Temp\v0eoltcc.a10'

You may receive this error when running Install-AksHci.

You can get more information by running $error = Install-AksHci and then $error[0].Exception.InnerException.

PowerShell deployment doesn't check for available memory before creating a new workload cluster

The Aks-Hci PowerShell commands do not validate the available memory on the host server before creating Kubernetes nodes. This issue can lead to memory exhaustion and virtual machines that do not start. This failure is currently not handled gracefully, and the deployment will stop responding with no clear error message.

If you have a deployment that stops responding, open Event Viewer and check for a Hyper-V-related error message indicating there's not enough memory to start the VM.

An 'Unable to acquire token' error appears when running Set-AksHciRegistration

This error can occur when you have multiple tenants on your Azure account.

Use $tenantId = (Get-AzContext).Tenant.Id to set the right tenant. Then, include this tenant as a parameter while running Set-AksHciRegistration.

Error: 'Waiting for pod 'Cloud Operator' to be ready'

When attempting to deploy an AKS cluster on an Azure VM, the installation was stuck at Waiting for pod 'Cloud Operator' to be ready..., and then failed and timed out after two hours. Attempts to troubleshoot by checking the gateway and DNS server showed they were working appropriately. Checks for IP or MAC address conflicts found none. The logs didn't show the VIP pool. There was a restriction on pulling the container image using sudo docker pull ecpacr.azurecr.io/kube-vip:0.3.4 that returned a Transport Layer Security (TLS) timeout instead of unauthorized.

To resolve this issue, do the following steps:

  1. Start to deploy your cluster.
  2. When the cluster is deployed, connect to your management cluster VM through SSH as shown below:
ssh -i (Get-MocConfig)['sshPrivateKey'] clouduser@<IP Address>
  1. Change the maximum transmission unit (MTU) setting. Don't hesitate to make the change; if you make the change too late, the deployment fails. Modifying the MTU setting helps unblock the container image pull.
sudo ifconfig eth0 mtu 1300
  1. To view the status of your containers, run the following command:
sudo docker ps -a

After you perform these steps, the container image pull should be unblocked.

Error: 'Install-Moc failed with error - Exception [Could not create the failover cluster generic role.]'

This error indicates that the cloud service's IP address is not a part of the cluster network and doesn't match any of the cluster networks that have the client and cluster communication role enabled.

To resolve this issue, run Get-ClusterNetwork where Role equals ClusterAndClient. Then, on one of the cluster nodes, select the name, address, and address mask to verify that the IP address provided for the -cloudServiceIP parameter of New-AksHciNetworkSetting matches one of the displayed networks.

Next steps

If you continue to run into problems when you're using AKS Arc, you can file bugs through GitHub.