Azure kubernetes service is not deploying neither with terraform, nor with manually

Gabor Varga 26 Reputation points
2023-10-25T18:05:22.71+00:00

Hi,

I am working on to create a kubernetes cluster in Azure. The whole infrastructure must be coded in terraform. This is fine. However, when I deploy the AKS cluster, the VMSS creation is always failing with the following error in the activity log:

{
	"status": "Failed",
	"error": {
		"code": "ResourceOperationFailure",
		"message": "The resource operation completed with terminal provisioning state 'Failed'.",
		"details": [
			{
				"code": "VMExtensionProvisioningError",
				"target": "0",		"message": "VM has reported a failure when processing extension 'vmssCSE' (publisher 'Microsoft.Azure.Extensions' and type 'CustomScript'). Error message: \"Enable failed: failed to execute command: command terminated with exit status=124\\\\n[stdout]\\\\n{ \"ExitCode\": \"124\", \"Output\": \"0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 62 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 63 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 64 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 65 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 66 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 67 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 68 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 69 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 70 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 71 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 72 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 73 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 74 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 75 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 76 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 77 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 78 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 79 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 80 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\\\\\\\\n+ '[' 81 -eq 100 ']'\\\\\\\\n+ sleep 1\\\\\\\\n+ for i in $(seq 1 $retries)\\\\\\\\n+ timeout 10 nc -vz clus-non-prod-cluster-cdyrgl0i.privatelink.westeurope.azmk8s.io 443\", \"Error\": \"\", \"ExecDuration\": \"900\", \"KernelStartTime\": \"Wed 2023-10-25 16:32:41 UTC\", \"CloudInitLocalStartTime\": \"Wed 2023-10-25 16:32:45 UTC\", \"CloudInitStartTime\": \"Wed 2023-10-25 16:32:48 UTC\", \"CloudFinalStartTime\": \"Wed 2023-10-25 16:32:56 UTC\", \"NetworkdStartTime\": \"Wed 2023-10-25 16:32:46 UTC\", \"CSEStartTime\": \"Wed Oct 25 16:33:02 UTC 2023\", \"GuestAgentStartTime\": \"Wed 2023-10-25 16:32:55 UTC\", \"SystemdSummary\": \"Startup finished in 2.546s (kernel) + 1min 30.260s (userspace) = 1min 32.807s \\\\\\\\ngraphical.target reached after 12.209s in userspace\", \"BootDatapoints\": { \"KernelStartTime\": \"Wed 2023-10-25 16:32:41 UTC\", \"CSEStartTime\": \"Wed Oct 25 16:33:02 UTC 2023\", \"GuestAgentStartTime\": \"Wed 2023-10-25 16:32:55 UTC\", \"KubeletStartTime\": \"Wed 2023-10-25 16:33:05 UTC\" } }\\\\n\\\\n[stderr]\\\\n\". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot. "
			}
		]
	}
}

The same issue happening when I want to create the cluster manually without terraform code.

Some words about the infra:

  • Terraform uses userDefinedRouting network outbound setup, while the manual one uses Load Balancer
  • The default outbound route is going through an Azure Firewall, where all traffic is allowed to internet direction. During the deployment a lot of traffic is visible on the Azure Firewall's log from AKS nodes.
  • DNS name resolution is working fine. Tested from a linux machine created manually for testing purposes into the same vnet where AKS is deployed in.

Anyone any idea?

Thanks!

G

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,931 questions
{count} votes

2 answers

Sort by: Most helpful
  1. AlaaBarqawi_MSFT 942 Reputation points Microsoft Employee
    2023-10-26T08:49:05.44+00:00

    Hi @Gabor Varga

    it seems there is connectivity issue from worker node to outbound that prevent the VMs to get provisioned

    do you have rout table + firewall in the AKS node subnet ?

    can you run this command from cloud shell ?

    # Get the VMSS instance IDs.
    az vmss list-instances --resource-group <mc-resource-group-name> \
        --name <vmss-name> \
        --output table
    
    # Use an instance ID to test outbound connectivity.
    az vmss run-command invoke --resource-group <mc-resource-group-name> \
        --name <vmss-name> \
        --command-id RunShellScript \
        --instance-id <vmss-instance-id> \
        --output json \
        --scripts "nc -vz mcr.microsoft.com 443"
    

    and send the results

    refer to :

    https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-outboundconnfailvmextensionerror

    https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-vmextensionprovisioningtimeout

    0 comments No comments

  2. Gabor Varga 26 Reputation points
    2023-10-26T11:53:25.2933333+00:00

    Hi,

    Meanwhile the issue was found.

    Root cause: I added the following route in the subnet's route table:

    • Address prefix: vnet address space
    • Next hop type: Virtual Network

    This routing setup caused some problem because if this route is configured, then the nodes cannot connect to the cluster api server.

    The default route (0.0.0.0/0) is pointing to the Azure firewall directly.

    DNS server is also the Azure firewall.

    I added the route mentioned above to all vnet-internal traffic bypass the firewall. But it seems it causes issues to accessing to api server.

    All other services works fine with this setup, except AKS api server.

    G