本指南旨在帮助识别和解决在大型 Microsoft Azure Kubernetes 服务 (AKS) 部署中 API 服务器中可能遇到的任何不太可能的问题。
Microsoft已以 5,000 个节点和 200,000 个 Pod 的规模测试 API 服务器的可靠性和性能。 包含 API 服务器的群集能够自动横向扩展和交付 Kubernetes 服务级别目标(SLO)。 如果遇到高延迟或超时,可能是因为分布式 etc 目录(etcd)上存在资源泄漏,或者有问题的客户端有过多的 API 调用。
AKSAudit
| where TimeGenerated between(now(-1h)..now()) // When you experienced the problem
| summarize count() by UserAgent
| top 10 by count_
| project UserAgent, count_
AzureDiagnostics
| where TimeGenerated between(now(-1h)..now()) // When you experienced the problem
| where Category == "kube-audit"
| extend event = parse_json(log_s)
| extend User = tostring(event.user.username)
| summarize count() by User
| top 10 by count_
| project User, count_
AKSAudit
| where TimeGenerated between(now(-1h)..now()) // When you experienced the problem
| extend HttpMethod = Verb
| extend Resource = tostring(ObjectRef.resource)
| where UserAgent == "DUMMYUSERAGENT" // Filter by name of the useragent you are interested in
| where Resource != ""
| extend start_time = RequestReceivedTime
| extend end_time = StageReceivedTime
| extend latency = datetime_diff('millisecond', end_time, start_time)
| summarize p99latency=percentile(latency, 99) by HttpMethod, Resource
| render table
AzureDiagnostics
| where TimeGenerated between(now(-1h)..now()) // When you experienced the problem
| where Category == "kube-audit"
| extend event = parse_json(log_s)
| extend HttpMethod = tostring(event.verb)
| extend Resource = tostring(event.objectRef.resource)
| extend User = tostring(event.user.username)
| where User == "DUMMYUSERAGENT" // Filter by name of the useragent you are interested in
| where Resource != ""
| extend start_time = todatetime(event.requestReceivedTimestamp)
| extend end_time = todatetime(event.stageTimestamp)
| extend latency = datetime_diff('millisecond', end_time, start_time)
| summarize p99latency=percentile(latency, 99) by HttpMethod, Resource
| render table
此查询的结果可用于识别上游 Kubernetes SLO 失败的 API 调用类型。 在大多数情况下,有问题的客户端可能会对一组太大的对象或对象发出过多 LIST 的调用。 遗憾的是,没有硬性可伸缩性限制可用于指导用户了解 API 服务器可伸缩性。 API 服务器或 etcd 可伸缩性限制取决于 Kubernetes 可伸缩性阈值中解释的各种因素。
原因 1:网络规则阻止从代理节点到 API 服务器的流量
网络规则可以阻止代理节点和 API 服务器之间的流量。
若要验证配置错误的网络策略是否阻止了 API 服务器和代理节点之间的通信,请运行以下 kubectl-aks 命令:
kubectl aks config import \
--subscription <mySubscriptionID> \
--resource-group <myResourceGroup> \
--cluster-name <myAKSCluster>
kubectl aks check-apiserver-connectivity --node <myNode>