很难确定与环境相关的问题,例如网络配置,这可能会导致 AKS 群集创建失败。 诊断检查器是一种 PowerShell 工具,可帮助识别由于环境中潜在问题而导致的 AKS 群集创建失败。
注意
只有在 AKS 群集已创建但处于失败状态时,才能使用诊断检查器工具。 如果在 Azure 门户中未看到 AKS 群集,则不能使用该工具。 如果在创建 Azure 资源管理器资源之前 AKS 群集创建失败,请提交支持请求。
开始之前
开始之前,请确保满足以下先决条件。 如果不满足运行诊断检查器工具的要求,请提交支持请求:
- 直接访问创建 AKS 群集的 Azure Local 群集。 此访问可以通过远程桌面(RDP)进行,也可以登录到某个 Azure Local 物理节点。
- 查看创建 AKS 群集的网络概念和 AKS 群集架构。
- 附加到 AKS 群集的逻辑网络的名称。
- AKS 群集的 SSH 私钥,用于登录到 AKS 群集控制平面节点 VM。
获取 AKS 群集的控制平面节点 VM IP
从 Azure Local 群集的任意一个物理节点运行以下命令。 确保传递的是 AKS 群集的名称,而不是 Azure 资源管理器 ID:
invoke-command -computername (get-clusternode) -script {get-vmnetworkadapter -vmname *} | Where-Object {$_.Name -like "$cluster_name*control-plane-*"} | select vmname, ipaddresses
预期输出:
VMName IPAddresses
------ -----------
<cluster-name>-XXXXXX-control-plane-XXXXXX {172.16.0.10, 172.16.0.4, fe80::ec:d3ff:fea0:1}
如果未看到如前面输出所示的控制平面 VM,请提交支持请求。
如果看到控制平面 VM,且其具有:
- 0 个 IPv4 地址:提交支持请求。
- 1 个 IP 地址:使用该 IPv4 地址作为
vmIP
参数的输入。 - 2 个 IP 地址:使用任意一个 IPv4 地址作为诊断检查器中
vmIP
参数的输入。
运行诊断检查器脚本
将以下名为 run_diagnostic.ps1
Azure 本地群集的 PowerShell 脚本复制到任何一个节点中:
<#
.SYNOPSIS
Runs diagnostic checker tool in target cluster control plane VM and returns the result.
This script runs the following tests from target cluster control plane VM:
1. cloud-agent-connectivity-test: Checks whether the DNS server can resolve the Moc cloud agent FQDN and that the cloud agent is reachable from the control plane node VM. Cloud agent is created using one of the IP addresses from the [management IP pool](hci/plan/cloud-deployment-network-considerations.md#management-ip-pool), on port 55000. The control plane node VM is given an IP address from the Arc VM logical network.
2. gateway-icmp-ping-test: Checks whether the gateway specified in the logical network attached to the AKS cluster is reachable from the AKS cluster control plane node VM.
3. http-connectivity-required-url-test: Checks whether the required URLs are reachable from the AKS cluster control plane node VM.
.DESCRIPTION
This script transfers a file from the local machine to a remote server using the SCP (Secure Copy Protocol) command.
.PARAMETER lnetName
The name of the LNET used for the cluster.
.PARAMETER sshPath
The path to the private SSH key for the target cluster.
.PARAMETER vmIP
IP of the target cluster control plane VM.
.EXAMPLE
.\run_diagnostic.ps1 -lnetName lnet1 -sshPath C:\Users\test\.ssh\test-ssh.pem -vmIP "172.16.0.10"
This example runs diagnostic checker tool in the VM with IP 172.16.0.10 using ssh key C:\Users\test\.ssh\test-ssh.pem and outputs the result.
#>
param (
[Parameter(Mandatory=$true)]
[string]$lnetName,
[Parameter(Mandatory=$true)]
[string]$sshPath,
[Parameter(Mandatory=$true)]
[string]$vmIP
)
$urlArray = @(
"https://management.azure.com",
"https://eastus.dp.kubernetesconfiguration.azure.com",
"https://login.microsoftonline.com",
"https://eastus.login.microsoft.com",
"https://login.windows.net",
"https://mcr.microsoft.com",
"https://gbl.his.arc.azure.com",
"https://k8connecthelm.download.prss.microsoft.com",
"https://guestnotificationservice.azure.com",
"https://sts.windows.net",
"https://graph.microsoft.com"
)
$urlList=$urlArray -join ","
# check vm is reachable
try {
$pingResult = Test-Connection -ComputerName $vmIP -Count 1 -ErrorAction Stop
if ($pingResult.StatusCode -eq 0) {
Write-Host "Connection to $vmIP succeeded."
} else {
Write-Host "Connection to AKS cluster control plane VM $vmIP failed with status code: $($pingResult.StatusCode). Please make sure AKS cluster control plane VM $vmIP is reachable from the host"
exit
}
} catch {
Write-Host "Connection to AKS cluster control plane VM $vmIP failed. Please make sure AKS cluster control plane VM $vmIP is reachable from the host"
Write-Host "Exception message: $_"
exit
}
# retreiving LNET
$lnet=get-mocvirtualnetwork -group Default_Group -name $lnetName
# getting gateway address from LNET
$gateway=$lnet.properties.subnets[0].properties.routeTable.properties.routes[0].properties.nextHopIpAddress
if (-not $gateway) {
Write-Error "Check Gateway address in the AKS logical network $lnetName"
exit
}
# getting cloudfqdn from archciconfig
$arcHCIConfig=get-archciconfig
$cloudFqdn="http://"+$arcHCIConfig.Item('cloudFQDN')+":55000"
$configContent = @"
checks:
- metadata:
creationTimestamp: null
name: cloud-agent-connectivity-test
parameters:
hostnames: <CLOUD_FQDN>
skipeof: "true"
type: HTTPConnectivity
- metadata:
annotations:
skip-error-on-failure: "true"
creationTimestamp: null
name: gateway-icmp-ping-test
parameters:
ips: <GATEWAY>
packetLossThreshold: "20"
type: ICMPPing
- metadata:
creationTimestamp: null
name: http-connectivity-required-url-test
parameters:
hostnames: <URL_LIST>
type: HTTPConnectivity
exports:
- metadata:
creationTimestamp: null
parameters:
filelocation: /home/clouduser/results.yaml
type: FileSystem
metadata:
creationTimestamp: null
"@
# update config file with the values of cloud fqdn, gateway and dns servers
$configContent = $configContent.replace("<CLOUD_FQDN>", $cloudFqdn)
$configContent = $configContent.replace("<GATEWAY>", $gateway)
$configContent = $configContent.replace("<URL_LIST>", $urlList)
$filePath = "config.yaml"
# Write to config.yaml
Set-Content -Path $filePath -Value $configContent
$dest = 'clouduser@' + $vmIP + ":config.yaml"
# Copy the config file to target cluster VM
Write-Host "Copying test config file to target cluster VM...."
$command = "scp -i $sshPath -o StrictHostKeyChecking=no -o BatchMode=yes config.yaml $dest"
try {
$output=invoke-expression $command
if ($LASTEXITCODE -ne 0) {
Write-Error "Couldn't ssh to AKS cluster control plane VM $vmIP. Please check the ssh key"
exit
}
} catch {
Write-Host "Couldn't ssh to AKS cluster control plane VM $vmIP. Please check the ssh key"
Write-Host "Exception message: $_"
exit
}
Write-Output "Copied config.yaml successfully."
$runScriptContent = @"
sudo su - root -c "/usr/bin/diagnostics-checker -c /home/clouduser/config.yaml"
"@
$filePath = "run_diag.sh"
Set-Content -Path $filePath -Value $runScriptContent
$dest = 'clouduser@' + $vmIP + ":run_diag.sh"
scp -i $sshPath -o StrictHostKeyChecking=no -o BatchMode=yes run_diag.sh $dest
$dest = 'clouduser@' + $vmIP
ssh -i $sshPath $dest -o StrictHostKeyChecking=no -o BatchMode=yes 'chmod +x run_diag.sh'
$sedCommand="sed -i -e 's/\r$//' run_diag.sh"
ssh -i $sshPath -o StrictHostKeyChecking=no -o BatchMode=yes $dest $sedCommand
if (Test-Path -Path "results.yaml") {
Remove-Item results.yaml
}
ssh -i $sshPath -o StrictHostKeyChecking=no -o BatchMode=yes $dest './run_diag.sh'
ssh -i $sshPath -o StrictHostKeyChecking=no -o BatchMode=yes $dest "sudo su - root -c 'chmod a+r /home/clouduser/results.yaml'"
$src= 'clouduser@' + $vmIP + ":results.yaml"
scp -i $sshPath -o StrictHostKeyChecking=no -o BatchMode=yes $src results.yaml
if (-Not (Test-Path -Path "results.yaml")) {
write-host "Test failed to perform"
exit
}
Install-Module powershell-yaml
$resultContent = Get-Content -path results.yaml | ConvertFrom-Yaml
$testResults = @()
$cloudAgentRecommendation = @"
Make sure that the logical network IP addresses can connect to all the management IP pool addresses on the required ports. Check AKS network port and cross vlan requirements for detailed list of ports that need to be opened.
"@
$gatewayRecommendation = @"
- Ensure gateway is operational
- Verify routing configurations
- Adjust firewall rules to allow ICMP traffic
"@
$urlRecommendation = @"
Ensure that the logical network IP addresses have outbound internet access. If there's a firewall, ensure that AKS required URLs are accessible from Arc VM logical network.
"@
foreach ($check in $resultContent.spec.checks) {
if ($check.result.outcome -like "Success") {
$recommendation=""
}elseif ($check.metadata.name -like "cloud-agent-connectivity-test") {
$recommendation=$cloudAgentRecommendation
}elseif ($check.metadata.name -like "gateway-icmp-ping-test") {
$recommendation=$gatewayRecommendation
}elseif ($check.metadata.name -like "http-connectivity-required-url-test") {
$recommendation=$urlRecommendation
}
$testResults += [PSCustomObject]@{
TestName=$check.metadata.name
Outcome= $check.result.outcome
Recommendation = $recommendation
}
}
$testResults | Format-Table -Wrap -AutoSize
示例输出:
TestName Outcome Recommendation
-------- ------- --------------
cloud-agent-connectivity-test Success
gateway-icmp-ping-test Success
http-connectivity-required-url-test Failure Ensure that the logical network IP addresses have outbound internet access. If there's a firewall, ensure that AKS required URLs are accessible from Arc VM logical network.
分析诊断检查器输出
下表汇总了脚本执行的每项测试,包括可能的失败原因和缓解建议:
测试名称 | 说明 | 失败原因 | 缓解建议 |
---|---|---|---|
云代理连接性测试 | 检查 DNS 服务器是否可以解析 MOC 云代理 FQDN,以及是否可以从控制平面节点 VM 访问云代理。 云代理使用管理 IP 池中的某个 IP 地址在端口 55000 上创建。 控制平面节点 VM 被分配来自 Arc VM 逻辑网络的 IP 地址。 | 逻辑网络 IP 地址无法连接到管理 IP 池地址,原因如下: - DNS 服务器解析不正确。 - 防火墙规则。 - 逻辑网络与管理 IP 池位于不同的 VLAN 中,且没有跨 VLAN 连接。 |
确保逻辑网络 IP 地址可以在所需端口上连接到所有管理 IP 池地址。 查看 AKS 网络端口和跨 VLAN 要求,获取需要打开的端口的详细列表。 |
gateway-icmp-ping-test | 检查是否可以从 AKS 群集控制平面节点 VM 访问附加到 AKS 群集的逻辑网络中指定的网关。 | - 网关已关闭或无法访问。 - AKS 群集控制平面节点 VM 与网关之间存在网络路由问题。 - 防火墙阻止 ICMP 流量。 |
- 确保网关正常运行。 - 验证路由配置。 - 调整防火墙规则以允许 ICMP 流量。 |
http-connectivity-required-url-test | 检查是否可以从 AKS 群集控制平面节点 VM 访问所需的 URL。 | - 控制平面节点 VM 没有出站互联网访问权限。 - 防火墙不允许所需的 URL 通过。 |
确保逻辑网络 IP 地址具有出站互联网访问权限。 如果存在防火墙,确保可以从 Arc VM 逻辑网络访问 AKS 所需的 URL。 |