你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

从容器见解创建日志搜索警报

容器见解可监视部署到托管或自托管 Kubernetes 群集的容器工作负荷的性能。 本文介绍如何在以下场景中使用 Azure Kubernetes 服务 (AKS) 群集创建基于日志的警报,以便针对重要内容发出警报:

  • 当群集节点上的 CPU 或内存利用率超过阈值时
  • 当控制器中任何容器上的 CPU 或内存利用率超过阈值时(与相应资源中设置的限制相比)
  • NotReady 状态节点计数
  • FailedPendingUnknownRunningSucceeded Pod 阶段计数
  • 当群集节点上的可用磁盘空间超过阈值时

若要针对群集节点上的 CPU 或内存利用率过高或可用磁盘空间不足发出警报,请使用提供的查询来创建指标警报或指标度量警报。 指标警报的延迟比日志搜索警报低,但日志搜索警报提供了高级查询和更复杂的功能。 日志搜索警报查询通过使用 now 运算符并将时间往过去推一个小时,将某个日期/时间与当前时间进行比较。 (容器见解以协调世界时 [UTC] 格式存储所有日期。)

重要

本文中的查询取决于容器见解收集并存储在 Log Analytics 工作区中的数据。 如果你修改了默认数据收集设置,查询可能不会返回预期结果。 最值得注意的是,如果你在为群集启用 Prometheus 指标后禁用了性能数据收集,则任何使用 Perf 表的查询都不会返回结果。

请参阅使用数据收集规则在容器见解中配置数据收集,了解预设配置(包括禁用性能数据收集)。 如需更多数据收集选项,请参阅使用 ConfigMap 在容器见解中配置数据收集

如果你不熟悉 Azure Monitor 警报,请在开始之前参阅 Microsoft Azure 中的警报概述。 若要详细了解使用日志查询的警报,请参阅 Azure Monitor 中的日志搜索警报。 有关指标警报的详细信息,请参阅 Azure Monitor 中的指标警报

日志查询度量

日志搜索警报可以衡量两种不同的内容,可用于监视不同方案中的虚拟机:

  • 结果计数:计算查询返回的行数,可用于处理 Windows 事件日志、Syslog、应用程序异常等事件。
  • 值的计算:基于数字列进行计算,可用于包含任意数量的资源。 例如 CPU 百分比。

以资源和维度为目标

可以使用一个规则通过维度监视多个实例的值。 例如,如果你想要监视运行网站或应用的多个实例上的 CPU 使用率,并针对 CPU 使用率超过 80% 的情况创建警报,则可以使用维度。

若要为订阅或资源组大规模创建以资源为中心的警报,可以按维度进行拆分。 如果要在多个 Azure 资源上监视相同的条件,按维度进行拆分会通过使用数字或字符串列对唯一组合进行分组,将警报拆分为独立的警报。 对 Azure 资源 ID 列进行拆分会使指定的资源进入警报目标。

在需要范围内的多个资源的条件时,你也可能会决定不进行拆分。 例如,你可能希望在资源组范围中至少有五台计算机的 CPU 使用率超过 80% 时创建警报。

屏幕截图:按维度拆分的新日志搜索警报规则。

你可能想要查看受影响计算机的警报列表。 你可以使用自定义工作簿,该工作簿使用自定义 Resource Graph 提供此视图。 请使用以下查询显示警报,并使用工作簿中的 Azure Resource Graph 数据源。

创建日志搜索警报规则

若要使用门户创建日志搜索警报规则,请参阅此日志搜索警报示例,其中提供了完整的演练。 可以遵循与此相同的过程,使用类似于本文的查询为 AKS 群集创建预警规则。

若要使用 Azure 资源管理器 (ARM) 模板创建查询警报规则,请参阅 Azure Monitor 中日志搜索警报规则的资源管理器模板示例。 可以使用这些相同的过程为本文中的日志查询创建 ARM 模板。

资源利用率

每分钟的平均 CPU 利用率,即成员节点的平均 CPU 利用率(指标度量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

每分钟的平均内存利用率,即成员节点的平均内存利用率(指标度量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

重要

以下查询使用占位符值 <your-cluster-name> 和 <your-controller-name> 来分别表示群集和控制器。 设置警报时,请将这些占位符替换为环境特定的值。

每分钟控制器中所有容器的平均 CPU 利用率,即控制器中每个容器实例的平均 CPU 利用率(指标度量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

每分钟控制器中所有容器的平均内存利用率,即控制器中每个容器实例的平均内存利用率(指标度量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

资源可用性

处于“就绪”和“未就绪”状态的节点和计数(指标度量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
    KubeNodeInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
                by ClusterName, Computer,  bin(TimeGenerated, trendBinSize)
    | extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project   TimeGenerated,
            ClusterName,
            Computer,
            ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
            NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc

以下查询返回基于所有阶段(FailedPendingUnknownRunningSucceeded)的 Pod 阶段计数。

let endDateTime = now(); 
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ClusterName == clusterName
    | distinct ClusterName, TimeGenerated
    | summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
    | join hint.strategy=broadcast (
        KubePodInventory
        | where TimeGenerated < endDateTime
        | where TimeGenerated >= startDateTime
        | summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
        | summarize TotalCount = count(),
                    PendingCount = sumif(1, PodStatus =~ 'Pending'),
                    RunningCount = sumif(1, PodStatus =~ 'Running'),
                    SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
                    FailedCount = sumif(1, PodStatus =~ 'Failed')
                by ClusterName, bin(TimeGenerated, trendBinSize)
    ) on ClusterName, TimeGenerated
    | extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
    | project TimeGenerated,
              TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
              PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
              RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
              SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
              FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
              UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)

注意

若要针对特定的 Pod 阶段(例如“Pending”、“Failed”或“Unknown”)发出警报,请修改查询的最后一行。 例如,若要针对 FailedCount 发出警报,请使用 | summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)

以下查询返回可用空间超过 90% 的已用群集节点磁盘。 若要获取群集 ID,请首先运行以下查询并从 ClusterId 属性中复制值:

InsightsMetrics
| extend Tags = todynamic(Tags)            
| project ClusterId = Tags['container.azm.ms/clusterId']   
| distinct tostring(ClusterId)   
let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'            
| where Namespace == 'container.azm.ms/disk'            
| extend Tags = todynamic(Tags)            
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val   
| where ClusterId =~ clusterId       
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90

当过去 10 分钟内的单个系统容器重启计数超过阈值时,会发出单个容器重启次数(结果数量)警报:

let _threshold = 10m; 
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold); 
let starttime = ago(5m); 
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name

后续步骤