Hello @Nicolas ESPIAU ,
Usually when nodes are stuck in "Creating" and they're still being seen from VMSS level as "Running", this indicates we have a problem with node provisioning/extensions (which in another words is about having an issue with onboarding these nodes to AKS).
Node provisioning errors usually occur due to an issue in outbound connectivity to the required FQDNs/ports that are mentioned here.
The most important endpoint for node provisioning is "mcr.microsoft.com" over port 443 and it's specifically mentioned here.
Required to access images in Microsoft Container Registry (MCR). This registry contains first-party images/charts (for example, coreDNS, etc.). These images are required for the correct creation and functioning of the cluster, including scale and upgrade operations.
Suggested Actions:
To investigate this further, we can
- Check nodes' extension details from the portal.
- Go to the Scale Set > Instances > Click on one of the VMs, then click on the status of the VM.
- Check the extension status and find the Exit code to identify the cause. For example, the exit code "50" refers to a failure in outbound connectivity as indicated in this link.
- Go to the Scale Set > Instances > Click on one of the VMs, then click on the status of the VM.
- Verify MCR reachability and name resolution from the impacted nodes using "az vmss run-command invoke"
- az vmss run-command invoke -g <MC_RG> -n <VMSS> --command-id RunShellScript --instance-id <Node-Instance-ID> --query 'value[0].message' -otsv --scripts "telnet mcr.microsoft.com 443"
- az vmss run-command invoke -g <MC_RG> -n <VMSS> --command-id RunShellScript --instance-id <Node-Instance-ID> --query 'value[0].message' -otsv --scripts "nslookup mcr.microsoft.com"
- If you have a firewall/NSG filtering outbound traffic or a custom DNS server, please make sure that the required FQDNs are resolvable/allowed.
I hope this proves useful to you!