建立容器深入解析的記錄搜尋警示

文章
10/15/2024

容器深入解析會監視部署到受控或自我管理 Kubernetes 叢集的容器工作負載效能。為了警示發生的問題，本文將說明如何使用 Azure Kubernetes Service (AKS) 叢集建立以記錄為基礎的警示：

當叢集節點上的 CPU 或記憶體使用率超過閾值時
當控制器內任何容器上的 CPU 或記憶體使用率超過閾值時 (與相應資源上設定的限制相比)
NotReady 狀態節點計數
Failed、Pending、Unknown、Running 或 Succeeded pod 階段計數
當叢集節點上的可用磁碟空間超過閾值時

若要針對叢集節點上的高 CPU 或記憶體使用率偏低或可用磁碟空間發出警示，請使用提供的查詢來建立計量警示或計量度量警示。計量警示的延遲低於記錄警示，但記錄警示卻提供進階查詢和更高複雜度。記錄搜尋警示查詢使用 now 運算子來比較日期時間和目前時間，並往前推一小時。 (容器深入解析會將所有日期以國際標準時間 [UTC] 格式儲存。)

重要

本文中的查詢取決於容器深入解析所收集的資料，並儲存在 Log Analytics 工作區中。如果您已修改預設資料收集設定，查詢可能不會傳回預期的結果。最重要的是，如果因為您已為叢集啟用 Prometheus 計量而停用收集效能資料，則任何使用 Perf 資料表的查詢都不會傳回結果。

請參閱使用資料收集規則在容器深入解析中設定資料收集，了解預先設定，包含停用效能資料收集。請參閱使用 ConfigMap 在容器深入解析中設定資料收集，了解更多資料收集的選項。

如果您不熟悉 Azure 監視器警示，請在開始之前先行參閱 Microsoft Azure 中的警示概觀。若要深入了解使用記錄查詢的警示，請參閱 Azure 監視器中的記錄搜尋警示。若要深入了解計量警示，請參閱 Azure 監視器中的計量警示。

記錄查詢測量

記錄搜尋警示可以測量兩種不同的東西，可用於監視不同案例中的虛擬機器：

結果計數：計算查詢傳回的資料列數目，並可用於處理 Windows 事件記錄、Syslog 和應用程式例外狀況等事件。
值的計算：根據數值資料行進行計算，可用於包含任意數目的資源。例如 CPU 百分比。

以資源和維度為目標

您可使用一個規則，藉由使用維度來監視多個執行個體的值。例如，如果您想要監視執行網站或應用程式的多個執行個體上的 CPU 使用量，並針對超過 80% 的 CPU 使用量建立警示。

若要針對訂用帳戶或資源群組大規模建立以資源為中心的警示，您可以依維度分割。當您想要在多個 Azure 資源上監視相同的條件時，依維度分割會將警示分割為個別的警示，其方法是使用數值或字串資料行將唯一組合進行分組。在 Azure 資源識別碼資料行上進行分割會使指定的資源成為警示目標。

當您想對範圍中的多個資源設定一項條件時，您也可以決定不要分割。例如，您想在資源群組範圍中至少有五部電腦的 CPU 使用量超過 80% 時建立警示。

您可能會想要查看受影響電腦的警示清單。您可以使用自訂活頁簿，該活頁簿使用自訂資源圖表來提供此檢視。使用下列查詢來顯示警示，並使用活頁簿中的資料來源 Azure Resource Graph。

建立記錄搜尋警示規則

若要使用入口網站來建立記錄搜尋警示規則，請參閱此記錄搜尋警示範例，其中提供完整的逐步解說。您可使用相同的流程，並透過類似本文中的查詢來建立 AKS 叢集的警示規則。

若要使用 Azure Resource Manager (ARM) 範本建立查詢警示規則，請參閱 Azure 監視器中記錄搜尋警示規則的 Resource Manager 範本範例。您可使用這些相同的程序，為本文中的記錄查詢建立 ARM 範本。

資源使用率

平均 CPU 使用率，以每分鐘成員節點的 CPU 使用率平均為單位 (計量測量)：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

平均記憶體使用率，以每分鐘成員節點的記憶體使用率平均為單位 (計量測量)：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

重要

下列查詢會使用預留位置值 <your-cluster-name> 和 <your-controller-name> 來代表您的叢集和控制器。當您設定警示時，請將其取代為特定於您環境的值。

制器中所有容器的平均 CPU 使用率，是控制器中每分鐘每個容器執行個體的 CPU 使用率平均 (計量測量)：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

控制器中所有容器的平均記憶體使用率，是控制器中每分鐘每個容器執行個體的記憶體使用率平均 (計量測量)：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

資源可用性

狀態為就緒和未就緒的節點和計數 (計量測量)：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
    KubeNodeInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
                by ClusterName, Computer,  bin(TimeGenerated, trendBinSize)
    | extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project   TimeGenerated,
            ClusterName,
            Computer,
            ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
            NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc

下列查詢會根據所有階段傳回 Pod 階段計數：Failed、Pending、Unknown、Running 或 Succeeded。

let endDateTime = now(); 
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ClusterName == clusterName
    | distinct ClusterName, TimeGenerated
    | summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
    | join hint.strategy=broadcast (
        KubePodInventory
        | where TimeGenerated < endDateTime
        | where TimeGenerated >= startDateTime
        | summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
        | summarize TotalCount = count(),
                    PendingCount = sumif(1, PodStatus =~ 'Pending'),
                    RunningCount = sumif(1, PodStatus =~ 'Running'),
                    SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
                    FailedCount = sumif(1, PodStatus =~ 'Failed')
                by ClusterName, bin(TimeGenerated, trendBinSize)
    ) on ClusterName, TimeGenerated
    | extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
    | project TimeGenerated,
              TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
              PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
              RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
              SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
              FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
              UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)

注意

若要在特定 Pod 階段發出警示，例如 Pending、Failed 或 Unknown，請修改查詢的最後一行。例如，若要在 FailedCount 發出警示，請使用 | summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)。

下列查詢會傳回使用超過 90% 可用空間的叢集節點磁碟。若要取得叢集識別碼，請先執行下列查詢，並從 ClusterId 屬性複製該值：

InsightsMetrics
| extend Tags = todynamic(Tags)            
| project ClusterId = Tags['container.azm.ms/clusterId']   
| distinct tostring(ClusterId)

let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'            
| where Namespace == 'container.azm.ms/disk'            
| extend Tags = todynamic(Tags)            
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val   
| where ClusterId =~ clusterId       
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90

當個別系統容器重新啟動計數超過過去 10 分鐘的閾值時，個別容器重新啟動 (結果數目) 就會發出警示：

let _threshold = 10m; 
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold); 
let starttime = ago(5m); 
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name

下一步

請檢視記錄查詢範例來查看預先定義的查詢和範例，以進行評估或自訂來警示、視覺化或分析您的叢集。
若要深入了解 Azure 監視器以及如何監視 Kubernetes 叢集的其他層面，請參閱檢視 Kubernetes 叢集效能和檢視 Kubernetes 叢集健康狀況。

分享方式：