Saling AKS nodepool, instances stuck in "creating" state

Nicolas ESPIAU 56 Reputation points
2022-12-09T16:10:17.18+00:00

Hello,

I'm trying to scale out a node pool in a private AKS cluster, and the starting nodes get stuck indefinitely in "creating" state.

If I go to the nodes I see their status is "running", but when I run the "new support request" helper I get this message:
"We found the following details of your deployment failure: the resource operation completed with terminal provisioning state 'failed'.. For general troubleshooting, use the following guides which cover the most common Azure deployment scenarios."

In the scaleset overview I see provisionning state="creating" forever.

Obviously in my cluster I can't see any node (either with kubectl or through Azure portal).

I've tried to scale down to 0 and up to 1 again, to delete the instances manually, to restart them, but whatever I try they get stuck to "creating" state indefinitely and I can't find any insight.

Thank you for your help.

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Azure Virtual Machine Scale Sets
Azure Virtual Machine Scale Sets
Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.
0 comments No comments
{count} vote

1 answer

Sort by: Most helpful
  1. Randa M. Al-Qudah 6 Reputation points Microsoft Employee
    2022-12-11T14:30:15.05+00:00

    Hello @Nicolas ESPIAU ,

    Usually when nodes are stuck in "Creating" and they're still being seen from VMSS level as "Running", this indicates we have a problem with node provisioning/extensions (which in another words is about having an issue with onboarding these nodes to AKS).

    Node provisioning errors usually occur due to an issue in outbound connectivity to the required FQDNs/ports that are mentioned here.
    The most important endpoint for node provisioning is "mcr.microsoft.com" over port 443 and it's specifically mentioned here.

    Required to access images in Microsoft Container Registry (MCR). This registry contains first-party images/charts (for example, coreDNS, etc.). These images are required for the correct creation and functioning of the cluster, including scale and upgrade operations.

    Suggested Actions:
    To investigate this further, we can

    1. Check nodes' extension details from the portal.
      • Go to the Scale Set > Instances > Click on one of the VMs, then click on the status of the VM. 269345-image.png
      • Check the extension status and find the Exit code to identify the cause. For example, the exit code "50" refers to a failure in outbound connectivity as indicated in this link. 269403-image.png
    2. Verify MCR reachability and name resolution from the impacted nodes using "az vmss run-command invoke"
      • az vmss run-command invoke -g <MC_RG> -n <VMSS> --command-id RunShellScript --instance-id <Node-Instance-ID> --query 'value[0].message' -otsv --scripts "telnet mcr.microsoft.com 443"
      • az vmss run-command invoke -g <MC_RG> -n <VMSS> --command-id RunShellScript --instance-id <Node-Instance-ID> --query 'value[0].message' -otsv --scripts "nslookup mcr.microsoft.com"
    3. If you have a firewall/NSG filtering outbound traffic or a custom DNS server, please make sure that the required FQDNs are resolvable/allowed.

    I hope this proves useful to you!

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.