你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
从容器见解查询日志
容器见解从容器主机和容器收集性能指标、清单数据和运行状况状态信息。 系统每三分钟收集一次数据,并转发到 Azure Monitor 中的 Log Analytics 工作区,在 Azure Monitor 中,可通过使用 Log Analytics 将这些数据用于日志查询。
此数据可应用于包括迁移计划、容量分析、发现和按需性能故障排除在内的方案。 Azure Monitor 日志有助于查找趋势、诊断瓶颈、预测或关联有助于确定是否最优执行当前群集配置的数据。
有关使用这些查询的信息,请参阅在 Azure Monitor Log Analytics 中使用查询。 有关使用 Log Analytics 运行查询并处理其结果的完整教程,请参阅 Log Analytics 教程。
打开 Log Analytics
启动 Log Analytics 有多种选项。 每个选项都以不同的范围开始。 要访问工作区中的所有数据,请在“监视”菜单上选择“日志”。 若要将数据限制为单个 Kubernetes 群集,请从相关群集的菜单中选择“日志”。
现有的日志查询
无需了解如何编写日志查询也能使用 Log Analytics。 可以从多个预生成查询中进行选择。 可以在不加修改的情况下运行这些查询,也可以将它们用作自定义查询的开始。 选择 Log Analytics 屏幕顶部的“查询”,并查看“资源类型”为“Kubernetes 服务”的查询。
容器表
有关容器见解所使用的表列表及其详细说明,请参阅 Azure Monitor 表格引用。 这些表都可用于日志查询。
示例日志查询
从一两个示例开始生成查询,然后修改它们以适应需求的做法通常很有用。 可使用以下示例查询进行试验,帮助生成更高级的查询。
列出容器的所有生命周期信息
ContainerInventory
| project Computer, Name, Image, ImageTag, ContainerState, CreatedTime, StartedTime, FinishedTime
| render table
Kubernetes 事件
KubeEvents
| where not(isempty(Namespace))
| sort by TimeGenerated desc
| render table
容器 CPU
Perf
| where ObjectName == "K8SContainer" and CounterName == "cpuUsageNanoCores"
| summarize AvgCPUUsageNanoCores = avg(CounterValue) by bin(TimeGenerated, 30m), InstanceName
容器内存
Perf
| where ObjectName == "K8SContainer" and CounterName == "memoryRssBytes"
| summarize AvgUsedRssMemoryBytes = avg(CounterValue) by bin(TimeGenerated, 30m), InstanceName
每分钟请求数(按照自定义指标)
InsightsMetrics
| where Name == "requests_count"
| summarize Val=any(Val) by TimeGenerated=bin(TimeGenerated, 1m)
| sort by TimeGenerated asc
| project RequestsPerMinute = Val - prev(Val), TimeGenerated
| render barchart
按名称和名称空间划分的 Pod
let startTimestamp = ago(1h);
KubePodInventory
| where TimeGenerated > startTimestamp
| project ContainerID, PodName=Name, Namespace
| where PodName contains "name" and Namespace startswith "namespace"
| distinct ContainerID, PodName
| join
(
ContainerLog
| where TimeGenerated > startTimestamp
)
on ContainerID
// at this point before the next pipe, columns from both tables are available to be "projected". Due to both
// tables having a "Name" column, we assign an alias as PodName to one column which we actually want
| project TimeGenerated, PodName, LogEntry, LogEntrySource
| summarize by TimeGenerated, LogEntry
| order by TimeGenerated desc
Pod 横向扩展 (HPA)
此查询返回每个部署中横向扩展的副本数。 它使用 HPA 中配置的最大副本数计算横向扩展百分比。
let _minthreshold = 70; // minimum threshold goes here if you want to setup as an alert
let _maxthreshold = 90; // maximum threshold goes here if you want to setup as an alert
let startDateTime = ago(60m);
KubePodInventory
| where TimeGenerated >= startDateTime
| where Namespace !in('default', 'kube-system') // List of non system namespace filter goes here.
| extend labels = todynamic(PodLabel)
| extend deployment_hpa = reverse(substring(reverse(ControllerName), indexof(reverse(ControllerName), "-") + 1))
| distinct tostring(deployment_hpa)
| join kind=inner (InsightsMetrics
| where TimeGenerated > startDateTime
| where Name == 'kube_hpa_status_current_replicas'
| extend pTags = todynamic(Tags) //parse the tags for values
| extend ns = todynamic(pTags.k8sNamespace) //parse namespace value from tags
| extend deployment_hpa = todynamic(pTags.targetName) //parse HPA target name from tags
| extend max_reps = todynamic(pTags.spec_max_replicas) // Parse maximum replica settings from HPA deployment
| extend desired_reps = todynamic(pTags.status_desired_replicas) // Parse desired replica settings from HPA deployment
| summarize arg_max(TimeGenerated, *) by tostring(ns), tostring(deployment_hpa), Cluster=toupper(tostring(split(_ResourceId, '/')[8])), toint(desired_reps), toint(max_reps), scale_out_percentage=(desired_reps * 100 / max_reps)
//| where scale_out_percentage > _minthreshold and scale_out_percentage <= _maxthreshold
)
on deployment_hpa
Nodepool 横向扩展
此查询返回每个节点池中的活动节点数。 它计算自动缩放程序设置中的可用活动节点数和最大节点配置,以确定横向扩展百分比。 请查看查询中的注释行,将其用于“结果数”警报规则。
let nodepoolMaxnodeCount = 10; // the maximum number of nodes in your auto scale setting goes here.
let _minthreshold = 20;
let _maxthreshold = 90;
let startDateTime = 60m;
KubeNodeInventory
| where TimeGenerated >= ago(startDateTime)
| extend nodepoolType = todynamic(Labels) //Parse the labels to get the list of node pool types
| extend nodepoolName = todynamic(nodepoolType[0].agentpool) // parse the label to get the nodepool name or set the specific nodepool name (like nodepoolName = 'agentpool)'
| summarize nodeCount = count(Computer) by ClusterName, tostring(nodepoolName), TimeGenerated
//(Uncomment the below two lines to set this as a log search alert)
//| extend scaledpercent = iff(((nodeCount * 100 / nodepoolMaxnodeCount) >= _minthreshold and (nodeCount * 100 / nodepoolMaxnodeCount) < _maxthreshold), "warn", "normal")
//| where scaledpercent == 'warn'
| summarize arg_max(TimeGenerated, *) by nodeCount, ClusterName, tostring(nodepoolName)
| project ClusterName,
TotalNodeCount= strcat("Total Node Count: ", nodeCount),
ScaledOutPercentage = (nodeCount * 100 / nodepoolMaxnodeCount),
TimeGenerated,
nodepoolName
系统容器(副本集)可用性
此查询返回系统容器 (replicaset) 并报告不可用百分比。 请查看查询中的注释行,将其用于“结果数”警报规则。
let startDateTime = 5m; // the minimum time interval goes here
let _minalertThreshold = 50; //Threshold for minimum and maximum unavailable or not running containers
let _maxalertThreshold = 70;
KubePodInventory
| where TimeGenerated >= ago(startDateTime)
| distinct ClusterName, TimeGenerated
| summarize Clustersnapshot = count() by ClusterName
| join kind=inner (
KubePodInventory
| where TimeGenerated >= ago(startDateTime)
| where Namespace in('default', 'kube-system') and ControllerKind == 'ReplicaSet' // the system namespace filter goes here
| distinct ClusterName, Computer, PodUid, TimeGenerated, PodStatus, ServiceName, PodLabel, Namespace, ContainerStatus
| summarize arg_max(TimeGenerated, *), TotalPODCount = count(), podCount = sumif(1, PodStatus == 'Running' or PodStatus != 'Running'), containerNotrunning = sumif(1, ContainerStatus != 'running')
by ClusterName, TimeGenerated, ServiceName, PodLabel, Namespace
)
on ClusterName
| project ClusterName, ServiceName, podCount, containerNotrunning, containerNotrunningPercent = (containerNotrunning * 100 / podCount), TimeGenerated, PodStatus, PodLabel, Namespace, Environment = tostring(split(ClusterName, '-')[3]), Location = tostring(split(ClusterName, '-')[4]), ContainerStatus
//Uncomment the below line to set for automated alert
//| where PodStatus == "Running" and containerNotrunningPercent > _minalertThreshold and containerNotrunningPercent < _maxalertThreshold
| summarize arg_max(TimeGenerated, *), c_entry=count() by PodLabel, ServiceName, ClusterName
//Below lines are to parse the labels to identify the impacted service/component name
| extend parseLabel = replace(@'k8s-app', @'k8sapp', PodLabel)
| extend parseLabel = replace(@'app.kubernetes.io/component', @'appkubernetesiocomponent', parseLabel)
| extend parseLabel = replace(@'app.kubernetes.io/instance', @'appkubernetesioinstance', parseLabel)
| extend tags = todynamic(parseLabel)
| extend tag01 = todynamic(tags[0].app)
| extend tag02 = todynamic(tags[0].k8sapp)
| extend tag03 = todynamic(tags[0].appkubernetesiocomponent)
| extend tag04 = todynamic(tags[0].aadpodidbinding)
| extend tag05 = todynamic(tags[0].appkubernetesioinstance)
| extend tag06 = todynamic(tags[0].component)
| project ClusterName, TimeGenerated,
ServiceName = strcat( ServiceName, tag01, tag02, tag03, tag04, tag05, tag06),
ContainerUnavailable = strcat("Unavailable Percentage: ", containerNotrunningPercent),
PodStatus = strcat("PodStatus: ", PodStatus),
ContainerStatus = strcat("Container Status: ", ContainerStatus)
系统容器(守护程序集)可用性
此查询返回系统容器 (daemonset) 并报告不可用百分比。 请查看查询中的注释行,将其用于“结果数”警报规则。
let startDateTime = 5m; // the minimum time interval goes here
let _minalertThreshold = 50; //Threshold for minimum and maximum unavailable or not running containers
let _maxalertThreshold = 70;
KubePodInventory
| where TimeGenerated >= ago(startDateTime)
| distinct ClusterName, TimeGenerated
| summarize Clustersnapshot = count() by ClusterName
| join kind=inner (
KubePodInventory
| where TimeGenerated >= ago(startDateTime)
| where Namespace in('default', 'kube-system') and ControllerKind == 'DaemonSet' // the system namespace filter goes here
| distinct ClusterName, Computer, PodUid, TimeGenerated, PodStatus, ServiceName, PodLabel, Namespace, ContainerStatus
| summarize arg_max(TimeGenerated, *), TotalPODCount = count(), podCount = sumif(1, PodStatus == 'Running' or PodStatus != 'Running'), containerNotrunning = sumif(1, ContainerStatus != 'running')
by ClusterName, TimeGenerated, ServiceName, PodLabel, Namespace
)
on ClusterName
| project ClusterName, ServiceName, podCount, containerNotrunning, containerNotrunningPercent = (containerNotrunning * 100 / podCount), TimeGenerated, PodStatus, PodLabel, Namespace, Environment = tostring(split(ClusterName, '-')[3]), Location = tostring(split(ClusterName, '-')[4]), ContainerStatus
//Uncomment the below line to set for automated alert
//| where PodStatus == "Running" and containerNotrunningPercent > _minalertThreshold and containerNotrunningPercent < _maxalertThreshold
| summarize arg_max(TimeGenerated, *), c_entry=count() by PodLabel, ServiceName, ClusterName
//Below lines are to parse the labels to identify the impacted service/component name
| extend parseLabel = replace(@'k8s-app', @'k8sapp', PodLabel)
| extend parseLabel = replace(@'app.kubernetes.io/component', @'appkubernetesiocomponent', parseLabel)
| extend parseLabel = replace(@'app.kubernetes.io/instance', @'appkubernetesioinstance', parseLabel)
| extend tags = todynamic(parseLabel)
| extend tag01 = todynamic(tags[0].app)
| extend tag02 = todynamic(tags[0].k8sapp)
| extend tag03 = todynamic(tags[0].appkubernetesiocomponent)
| extend tag04 = todynamic(tags[0].aadpodidbinding)
| extend tag05 = todynamic(tags[0].appkubernetesioinstance)
| extend tag06 = todynamic(tags[0].component)
| project ClusterName, TimeGenerated,
ServiceName = strcat( ServiceName, tag01, tag02, tag03, tag04, tag05, tag06),
ContainerUnavailable = strcat("Unavailable Percentage: ", containerNotrunningPercent),
PodStatus = strcat("PodStatus: ", PodStatus),
ContainerStatus = strcat("Container Status: ", ContainerStatus)
资源日志
AKS 的资源日志存储在 AzureDiagnostics 表中。 可以使用“类别”列来区分不同的日志。 有关每个类别的说明,请参阅 AKS 参考资源日志。 以下示例需要诊断扩展,以将 AKS 群集的资源日志发送到 Log Analytics 工作区。 有关详细信息,请参阅配置监视。
API 服务器日志
AzureDiagnostics
| where Category == "kube-apiserver"
统计每个类别的日志
AzureDiagnostics
| where ResourceType == "MANAGEDCLUSTERS"
| summarize count() by Category
查询 Prometheus 指标数据
以下示例是一个 Prometheus 指标查询,显示每个节点每个磁盘每秒的磁盘读取次数。
InsightsMetrics
| where Namespace == 'container.azm.ms/diskio'
| where TimeGenerated > ago(1h)
| where Name == 'reads'
| extend Tags = todynamic(Tags)
| extend HostName = tostring(Tags.hostName), Device = Tags.name
| extend NodeDisk = strcat(Device, "/", HostName)
| order by NodeDisk asc, TimeGenerated asc
| serialize
| extend PrevVal = iif(prev(NodeDisk) != NodeDisk, 0.0, prev(Val)), PrevTimeGenerated = iif(prev(NodeDisk) != NodeDisk, datetime(null), prev(TimeGenerated))
| where isnotnull(PrevTimeGenerated) and PrevTimeGenerated != TimeGenerated
| extend Rate = iif(PrevVal > Val, Val / (datetime_diff('Second', TimeGenerated, PrevTimeGenerated) * 1), iif(PrevVal == Val, 0.0, (Val - PrevVal) / (datetime_diff('Second', TimeGenerated, PrevTimeGenerated) * 1)))
| where isnotnull(Rate)
| project TimeGenerated, NodeDisk, Rate
| render timechart
若要查看 Azure Monitor 抓取并按命名空间筛选的 Prometheus 指标,请指定“prometheus”。 下面是一个示例查询,演示如何从 default
Kubernetes 命名空间查看 Prometheus 指标。
InsightsMetrics
| where Namespace == "prometheus"
| extend tags=parse_json(Tags)
| summarize count() by Name
Prometheus 数据也可直接按名称查询。
InsightsMetrics
| where Namespace == "prometheus"
| where Name contains "some_prometheus_metric"
查询配置或抓取错误
为了调查任何配置或抓取错误,下面的示例查询将返回 KubeMonAgentEvents
表中的信息性事件。
KubeMonAgentEvents | where Level != "Info"
输出显示类似于以下示例的结果:
后续步骤
容器见解不包含预定义的警报集。 若要了解如何针对 CPU 和内存使用率过高的情况创建建议的警报,为 DevOps 或操作流程和过程提供支持,请查看使用容器见解创建性能警报。