Description of the Problem
Hello,
I am suing the AKS-Construction Helper (https://github.com/Azure/AKS-Construction) to deploy a public azure kubernetes cluster. At the end, I get the following command and parameters file:
az deployment group create -g alfred-butler-dev-rg --template-uri https://github.com/Azure/AKS-Construction/releases/download/0.10.7/main.json --parameters \
resourceName=my-aks-dev\
managedNodeResourceGroup=my-managed-aks-dev-rg \
kubernetesVersion=1.32.4 \
agentCount=5 \
upgradeChannel=stable \
agentCountMax=10 \
osDiskType=Managed \
osDiskSizeGB=32 \
custom_vnet=true \
enable_aad=true \
AksDisableLocalAccounts=true \
enableAzureRBAC=true \
adminPrincipalId=$(az ad signed-in-user show --query id --out tsv) \
registries_sku=Basic \
acrPushRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
azureFirewalls=true \
certManagerFW=true \
enableNodePublicIP=true \
omsagent=true \
retentionInDays=90 \
containerLogsV2BasicLogs=true \
serviceMeshProfile=NaN \
azurepolicy=audit \
podCidr=10.240.100.0/24 \
cniDynamicIpAllocation=true \
dnsZoneId=/subscriptions/72467751-07c0-46cb-8104-517f3e9cfd63/resourceGroups/civilia-dev-dns-rg/providers/Microsoft.Network/dnszones/dev-civilia.ca \
ingressApplicationGateway=true \
appGWcount=8 \
appGWsku=WAF_v2 \
appGWmaxCount=20 \
appgwKVIntegration=true \
keyVaultKmsCreate=true \
keyVaultKmsOfficerRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
keyVaultAksCSI=true \
keyVaultCreate=true \
keyVaultOfficerRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
oidcIssuer=true \
workloadIdentity=true \
enableSysLog=true
when I connect to the created cluster and check the pods in the kube-system namespace; I get the following:
kubectl get pods -o wide --namespace kube-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aks-secrets-store-csi-driver-75kgs 2/3 CrashLoopBackOff 28 (3m42s ago) 54m 10.240.100.95 aks-userpool01-68104822-vmss000001 <none> <none>
aks-secrets-store-csi-driver-hx8h5 1/3 CrashLoopBackOff 29 (3m17s ago) 58m 10.240.100.19 aks-agentpool-30311657-vmss000000 <none> <none>
aks-secrets-store-csi-driver-qx4rg 2/3 CrashLoopBackOff 28 (4m25s ago) 54m 10.240.100.58 aks-userpool01-68104822-vmss000000 <none> <none>
aks-secrets-store-csi-driver-wpgn6 2/3 CrashLoopBackOff 28 (4m37s ago) 54m 10.240.100.84 aks-userpool01-68104822-vmss000003 <none> <none>
aks-secrets-store-csi-driver-x6dt8 1/3 CrashLoopBackOff 27 (3m44s ago) 53m 10.240.100.106 aks-userpool01-68104822-vmss000004 <none> <none>
aks-secrets-store-csi-driver-x8bq4 2/3 CrashLoopBackOff 28 (4m17s ago) 54m 10.240.100.51 aks-userpool01-68104822-vmss000002 <none> <none>
aks-secrets-store-provider-azure-c57hs 1/1 Running 0 54m 10.240.0.9 aks-userpool01-68104822-vmss000001 <none> <none>
aks-secrets-store-provider-azure-h2rl5 1/1 Running 0 54m 10.240.0.7 aks-userpool01-68104822-vmss000002 <none> <none>
aks-secrets-store-provider-azure-l5scf 1/1 Running 0 58m 10.240.0.4 aks-agentpool-30311657-vmss000000 <none> <none>
aks-secrets-store-provider-azure-txqkz 1/1 Running 0 54m 10.240.0.6 aks-userpool01-68104822-vmss000003 <none> <none>
aks-secrets-store-provider-azure-wzsmn 1/1 Running 0 53m 10.240.0.5 aks-userpool01-68104822-vmss000004 <none> <none>
aks-secrets-store-provider-azure-xbn44 1/1 Running 0 54m 10.240.0.8 aks-userpool01-68104822-vmss000000 <none> <none>
ama-logs-5r878 2/2 Running 0 54m 10.240.100.76 aks-userpool01-68104822-vmss000003 <none> <none>
ama-logs-c8j4w 2/2 Running 0 53m 10.240.100.109 aks-userpool01-68104822-vmss000004 <none> <none>
ama-logs-drpzj 2/2 Running 0 57m 10.240.100.30 aks-agentpool-30311657-vmss000000 <none> <none>
ama-logs-l66xn 2/2 Running 0 54m 10.240.100.94 aks-userpool01-68104822-vmss000001 <none> <none>
ama-logs-rs-78d658c9-tfkh7 1/1 Running 0 57m 10.240.100.13 aks-agentpool-30311657-vmss000000 <none> <none>
ama-logs-tpxdj 2/2 Running 0 54m 10.240.100.62 aks-userpool01-68104822-vmss000000 <none> <none>
ama-logs-zz7xd 2/2 Running 0 54m 10.240.100.39 aks-userpool01-68104822-vmss000002 <none> <none>
azure-cns-8z68p 1/1 Running 0 54m 10.240.0.7 aks-userpool01-68104822-vmss000002 <none> <none>
azure-cns-9ztdr 1/1 Running 0 54m 10.240.0.6 aks-userpool01-68104822-vmss000003 <none> <none>
azure-cns-fdmtr 1/1 Running 0 54m 10.240.0.8 aks-userpool01-68104822-vmss000000 <none> <none>
azure-cns-g6ssq 1/1 Running 0 58m 10.240.0.4 aks-agentpool-30311657-vmss000000 <none> <none>
azure-cns-jdgpt 1/1 Running 0 54m 10.240.0.9 aks-userpool01-68104822-vmss000001 <none> <none>
azure-cns-tlgf5 1/1 Running 0 53m 10.240.0.5 aks-userpool01-68104822-vmss000004 <none> <none>
azure-policy-86887785db-k885d 1/1 Running 0 60m 10.240.100.24 aks-agentpool-30311657-vmss000000 <none> <none>
azure-policy-webhook-6fbc657745-9jddk 0/1 CrashLoopBackOff 9 (4m21s ago) 27m 10.240.100.104 aks-userpool01-68104822-vmss000004 <none> <none>
azure-wi-webhook-controller-manager-58bf869886-6gdll 0/1 CrashLoopBackOff 15 (2m33s ago) 57m 10.240.100.25 aks-agentpool-30311657-vmss000000 <none> <none>
azure-wi-webhook-controller-manager-58bf869886-tjb4d 0/1 CrashLoopBackOff 15 (2m26s ago) 57m 10.240.100.22 aks-agentpool-30311657-vmss000000 <none> <none>
cloud-node-manager-d2lf8 1/1 Running 0 54m 10.240.0.6 aks-userpool01-68104822-vmss000003 <none> <none>
cloud-node-manager-pj9sq 1/1 Running 0 58m 10.240.0.4 aks-agentpool-30311657-vmss000000 <none> <none>
cloud-node-manager-s5j29 1/1 Running 0 54m 10.240.0.7 aks-userpool01-68104822-vmss000002 <none> <none>
cloud-node-manager-tsjvv 1/1 Running 0 54m 10.240.0.8 aks-userpool01-68104822-vmss000000 <none> <none>
cloud-node-manager-z2zpg 1/1 Running 0 54m 10.240.0.9 aks-userpool01-68104822-vmss000001 <none> <none>
cloud-node-manager-zc2mw 1/1 Running 0 53m 10.240.0.5 aks-userpool01-68104822-vmss000004 <none> <none>
coredns-77f74c584-2slsm 0/1 Running 0 60m 10.240.100.7 aks-agentpool-30311657-vmss000000 <none> <none>
coredns-autoscaler-79bcb4fd6b-l9k2q 1/1 Running 22 (5m28s ago) 60m 10.240.100.14 aks-agentpool-30311657-vmss000000 <none> <none>
csi-azuredisk-node-fc5b9 3/3 Running 0 54m 10.240.0.7 aks-userpool01-68104822-vmss000002 <none> <none>
csi-azuredisk-node-fw45w 3/3 Running 0 54m 10.240.0.8 aks-userpool01-68104822-vmss000000 <none> <none>
csi-azuredisk-node-jvtmx 3/3 Running 0 53m 10.240.0.5 aks-userpool01-68104822-vmss000004 <none> <none>
csi-azuredisk-node-nk7gk 3/3 Running 0 58m 10.240.0.4 aks-agentpool-30311657-vmss000000 <none> <none>
csi-azuredisk-node-ttpj8 3/3 Running 0 54m 10.240.0.9 aks-userpool01-68104822-vmss000001 <none> <none>
csi-azuredisk-node-zkzpv 3/3 Running 0 54m 10.240.0.6 aks-userpool01-68104822-vmss000003 <none> <none>
csi-azurefile-node-7fdzf 3/3 Running 0 54m 10.240.0.6 aks-userpool01-68104822-vmss000003 <none> <none>
csi-azurefile-node-8mhbt 3/3 Running 0 54m 10.240.0.7 aks-userpool01-68104822-vmss000002 <none> <none>
csi-azurefile-node-8w2pd 3/3 Running 0 58m 10.240.0.4 aks-agentpool-30311657-vmss000000 <none> <none>
csi-azurefile-node-kcrsv 3/3 Running 0 54m 10.240.0.9 aks-userpool01-68104822-vmss000001 <none> <none>
csi-azurefile-node-l5nkt 3/3 Running 0 54m 10.240.0.8 aks-userpool01-68104822-vmss000000 <none> <none>
csi-azurefile-node-w6mpb 3/3 Running 0 53m 10.240.0.5 aks-userpool01-68104822-vmss000004 <none> <none>
ingress-appgw-deployment-5c9c7bbdbc-jwgsm 0/1 CrashLoopBackOff 19 (4m37s ago) 60m 10.240.100.16 aks-agentpool-30311657-vmss000000 <none> <none>
konnectivity-agent-7ccd6c6945-vfl72 0/1 Running 2 (2m27s ago) 8m27s 10.240.100.54 aks-userpool01-68104822-vmss000000 <none> <none>
konnectivity-agent-autoscaler-844df78bbd-bl2w7 0/1 CrashLoopBackOff 4 (47s ago) 3m27s 10.240.100.33 aks-agentpool-30311657-vmss000000 <none> <none>
kube-proxy-chrrl 1/1 Running 0 58m 10.240.0.4 aks-agentpool-30311657-vmss000000 <none> <none>
kube-proxy-kslf5 1/1 Running 0 53m 10.240.0.5 aks-userpool01-68104822-vmss000004 <none> <none>
kube-proxy-lbf4z 1/1 Running 0 54m 10.240.0.9 aks-userpool01-68104822-vmss000001 <none> <none>
kube-proxy-lsrqm 1/1 Running 0 54m 10.240.0.8 aks-userpool01-68104822-vmss000000 <none> <none>
kube-proxy-mfkgf 1/1 Running 0 54m 10.240.0.7 aks-userpool01-68104822-vmss000002 <none> <none>
kube-proxy-txpsl 1/1 Running 0 54m 10.240.0.6 aks-userpool01-68104822-vmss000003 <none> <none>
metrics-server-6c47bdccfd-fwgjc 1/2 CrashLoopBackOff 15 (2m32s ago) 60m 10.240.100.8 aks-agentpool-30311657-vmss000000 <none> <none>
metrics-server-6c47bdccfd-kjdsz 1/2 CrashLoopBackOff 15 (2m13s ago) 60m 10.240.100.10 aks-agentpool-30311657-vmss000000 <none> <none>
It turns out all the failing deployments and their corresponding pods are due to failing readiness and liveness probes (like the following logs from the ingress-appgw deployment):
Events:
Type Reason Age From Message
Warning FailedScheduling 59m (x17 over 62m) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 59m default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Normal Scheduled 59m default-scheduler Successfully assigned kube-system/ingress-appgw-deployment-5c9c7bbdbc-jwgsm to aks-agentpool-30311657-vmss000000
Normal Pulling 59m kubelet Pulling image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.8.1"
Normal Pulled 59m kubelet Successfully pulled image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.8.1" in 1.586s (1.586s including waiting). Image size: 27686229 bytes.
Warning Unhealthy 56m (x8 over 59m) kubelet Liveness probe failed: Get "http://10.240.100.16:8123/health/alive": dial tcp 10.240.100.16:8123: connect: connection refused
Normal Created 54m (x6 over 59m) kubelet Created container: ingress-appgw-container
Normal Started 54m (x6 over 59m) kubelet Started container ingress-appgw-container
Normal Killing 14m (x17 over 58m) kubelet Container ingress-appgw-container failed liveness probe, will be restarted
Warning Unhealthy 14m (x114 over 59m) kubelet Readiness probe failed: Get "http://10.240.100.16:8123/health/ready": dial tcp 10.240.100.16:8123: connect: connection refused
Warning BackOff 4m15s (x166 over 53m) kubelet Back-off restarting failed container ingress-appgw-container in pod ingress-appgw-deployment-5c9c7bbdbc-jwgsm_kube-system(522153dc-3909-4fb8-af56-44e2e6d90c91)
Normal Pulled 70s (x20 over 58m) kubelet Container image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.8.1" already present on machine
I also noticed that the following command fails for any pod (i have to describe the pod to get the logs instead):
kubectl logs ingress-appgw-deployment-5c9c7bbdbc-jwgsm --namespace kube-system
Error from server: Get "https://10.240.0.4:10250/containerLogs/kube-system/ingress-appgw-deployment-5c9c7bbdbc-jwgsm/ingress-appgw-container": EOF
Did anyone get these issues? I am confused and stuck to find a solution. Thanks.