共用方式為


建立容器深入解析的記錄搜尋警示

容器深入解析會監視部署至受控或自控 Kubernetes 叢集的容器工作負載效能。 為了警示發生的問題,本文將說明如何使用 Azure Kubernetes Service (AKS) 叢集建立以記錄為基礎的警示:

  • 當叢集節點上的 CPU 或記憶體使用率超過閾值時
  • 當控制器內任何容器上的 CPU 或記憶體使用率超過閾值時 (與相應資源上設定的限制相比)
  • NotReady 狀態節點計數
  • FailedPendingUnknownRunningSucceeded pod 階段計數
  • 當叢集節點上的可用磁碟空間超過閾值時

若要針對叢集節點上的高 CPU 或記憶體使用率偏低或可用磁碟空間發出警示,請使用提供的查詢來建立計量警示或計量度量警示。 計量警示的延遲低於記錄警示,但記錄警示卻提供進階查詢和更高複雜度。 記錄搜尋警示查詢使用 now 運算子來比較日期時間和目前時間,並往前推一小時。 (容器深入解析會將所有日期以國際標準時間 [UTC] 格式儲存。)

重要事項

本文中的查詢取決於容器深入解析所收集的資料,並儲存在 Log Analytics 工作區中。 如果您已修改預設資料收集設定,查詢可能不會傳回預期的結果。 最重要的是,如果因為您已為叢集啟用 Prometheus 計量而停用收集效能資料,則任何使用 Perf 資料表的查詢都不會傳回結果。

請參閱使用資料收集規則在容器深入解析中設定資料收集,了解預先設定,包含停用效能資料收集。 如需進一步的數據收集選項,請參閱 使用 ConfigMap 在容器深入解析中 設定數據收集。

如果您不熟悉 Azure 監視器警示,請參閱開始之前 Microsoft Azure 中的警示概觀 。 若要深入瞭解使用記錄查詢的警示,請參閱 Azure 監視器中的記錄搜尋警示。 如需計量警示的詳細資訊,請參閱 Azure 監視器中的計量警示

記錄查詢測量

記錄搜尋警示 可以測量兩項不同的指標,可用來監控不同情境中的虛擬機器。

  • 結果計數:計算查詢傳回的數據列數目,並可用來處理 Windows 事件記錄、Syslog 和應用程式例外狀況等事件。
  • 值的計算:根據數值數據行進行計算,並可用來包含任意數目的資源。 例如 CPU 百分比。

以資源和維度為目標

您可使用一個規則,藉由使用維度來監視多個執行個體的值。 例如,如果您想要監視執行網站或應用程式的多個執行個體上的 CPU 使用量,並針對超過 80% 的 CPU 使用量建立警示。

若要為訂用帳戶或資源群組大規模建立以資源為中心的警示,您可以 依維度分割。 當您想要在多個 Azure 資源上監視相同的條件時,依維度分割會將警示分割為個別的警示,其方法是使用數值或字串資料行將唯一組合進行分組。 在 Azure 資源識別碼資料行上進行分割會使指定的資源成為警示目標。

當您想對範圍中的多個資源設定一項條件時,您也可以決定不要分割。 例如,您想在資源群組範圍中至少有五部電腦的 CPU 使用量超過 80% 時建立警示。

此螢幕快照顯示依維度分割的新記錄搜尋警示規則。

您可能會想要查看受影響電腦的警示清單。 您可以使用使用自訂 資源圖形 的自訂活頁簿來提供此檢視。 使用下列查詢來顯示警示,並使用活頁簿中的數據源 Azure Resource Graph

建立記錄搜尋警示規則

若要使用入口網站建立記錄搜尋警示規則,請參閱 此記錄搜尋警示範例,其中提供完整的逐步解說。 您可使用相同的流程,並透過類似本文中的查詢來建立 AKS 叢集的警示規則。

若要使用 Azure Resource Manager (ARM) 範本建立查詢警示規則,請參閱 Azure 監視器中記錄搜尋警示規則的 Resource Manager 範例。 您可使用這些相同的程序,為本文中的記錄查詢建立 ARM 範本。

資源使用率

平均 CPU 使用率,以每分鐘成員節點的 CPU 使用率平均為單位 (計量測量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

平均記憶體使用率,以每分鐘成員節點的記憶體使用率平均為單位 (計量測量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

重要事項

下列查詢會使用預留位置值 <your-cluster-name> 和 <your-controller-name> 來代表您的叢集和控制器。 當您設定警示時,請將其取代為特定於您環境的值。

制器中所有容器的平均 CPU 使用率,是控制器中每分鐘每個容器執行個體的 CPU 使用率平均 (計量測量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

控制器中所有容器的平均記憶體使用率,是控制器中每分鐘每個容器執行個體的記憶體使用率平均 (計量測量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

資源可用性

狀態為就緒和未就緒的節點和計數 (計量測量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
    KubeNodeInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
                by ClusterName, Computer,  bin(TimeGenerated, trendBinSize)
    | extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project   TimeGenerated,
            ClusterName,
            Computer,
            ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
            NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc

下列查詢會根據所有階段傳回 Pod 階段計數:FailedPendingUnknownRunningSucceeded

let endDateTime = now(); 
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ClusterName == clusterName
    | distinct ClusterName, TimeGenerated
    | summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
    | join hint.strategy=broadcast (
        KubePodInventory
        | where TimeGenerated < endDateTime
        | where TimeGenerated >= startDateTime
        | summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
        | summarize TotalCount = count(),
                    PendingCount = sumif(1, PodStatus =~ 'Pending'),
                    RunningCount = sumif(1, PodStatus =~ 'Running'),
                    SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
                    FailedCount = sumif(1, PodStatus =~ 'Failed')
                by ClusterName, bin(TimeGenerated, trendBinSize)
    ) on ClusterName, TimeGenerated
    | extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
    | project TimeGenerated,
              TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
              PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
              RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
              SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
              FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
              UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)

附註

若要在特定 Pod 階段發出警示,例如 PendingFailedUnknown,請修改查詢的最後一行。 例如,若要在 FailedCount 發出警示,請使用 | summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)

下列查詢會傳回使用超過 90% 可用空間的叢集節點磁碟。 若要取得叢集識別碼,請先執行下列查詢,並從 ClusterId 屬性複製該值:

InsightsMetrics
| extend Tags = todynamic(Tags)            
| project ClusterId = Tags['container.azm.ms/clusterId']   
| distinct tostring(ClusterId)   
let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'            
| where Namespace == 'container.azm.ms/disk'            
| extend Tags = todynamic(Tags)            
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val   
| where ClusterId =~ clusterId       
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90

當個別系統容器重新啟動計數超過過去 10 分鐘的閾值時,個別容器重新啟動 (結果數目) 就會發出警示:

let _threshold = 10m; 
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold); 
let starttime = ago(5m); 
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name

後續步驟

  • 檢視 記錄查詢範例 ,以查看預先定義的查詢和範例,以評估或自定義警示、可視化或分析叢集。