從容器深入解析建立記錄搜尋警示
容器深入解析會監視部署到受控或自我管理 Kubernetes 叢集的容器工作負載效能。 為了警示重要事項,本文說明如何使用 Azure Kubernetes Service (AKS) 叢集建立下列情況的記錄型警示:
- 當叢集節點上的 CPU 或記憶體使用率超過閾值時
- 當控制器內任何容器的CPU或記憶體使用率超過臨界值時,與對應資源上設定的限制相比
NotReady
狀態節點計數Failed
、Pending
、Unknown
、Running
、 或Succeeded
Pod 階段計數- 當叢集節點上的可用磁碟空間超過閾值時
若要針對高 CPU 或記憶體使用率或叢集節點上的可用磁碟空間不足發出警示,請使用提供的查詢來建立計量警示或計量測量警示。 計量警示的延遲比記錄搜尋警示低,但記錄搜尋警示可提供進階查詢和更高的複雜度。 記錄搜尋警示查詢會使用 now
運算符比較日期時間與目前,並返回一小時。 (容器深入解析會以國際標準時間 [UTC] 格式儲存所有日期。
重要
本文中的查詢取決於容器深入解析所收集的數據,並儲存在Log Analytics工作區中。 如果您已修改預設數據收集設定,查詢可能不會傳回預期的結果。 最值得注意的是,如果您已停用收集效能數據,因為您已為叢集啟用 Prometheus 計量,則任何使用 Perf
數據表的查詢都不會傳回結果。
如需預設設定,請參閱 使用數據收集規則 在容器深入解析中設定數據收集,包括停用效能數據收集。 如需進一步的數據收集選項,請參閱 使用 ConfigMap 在容器深入解析中設定數據收集。
如果您不熟悉 Azure 監視器警示,請參閱 Microsoft Azure 中的警示概觀,再開始。 若要深入瞭解使用記錄查詢的警示,請參閱 Azure 監視器中的記錄搜尋警示。 如需計量警示的詳細資訊,請參閱 Azure 監視器中的計量警示。
記錄查詢度量
記錄搜尋警示 可以測量兩個不同的專案,可用來監視不同案例中的虛擬機:
- 結果計數:計算查詢傳回的資料列數目,並可用於處理 Windows 事件記錄、Syslog 和應用程式例外狀況等事件。
- 值的計算:根據數值資料行進行計算,可用於包含任意數目的資源。 例如 CPU 百分比。
目標資源和維度
您可以使用一個規則來監視多個實例的值,方法是使用維度。 例如,如果您想要監視執行網站或應用程式的多個實例上的CPU使用量,並建立超過80%的CPU使用量警示,請使用維度。
若要為訂用帳戶或資源群組大規模建立以資源為中心的警示,您可以 依維度分割。 當您想要在多個 Azure 資源上監視相同的條件時,依維度分割會將警示分割成個別的警示,方法是使用數值或字串數據行來分組唯一的組合。 分割 Azure 資源識別碼數據行會使指定的資源進入警示目標。
當您想要範圍中的多個資源有條件時,您也可以決定不要分割。 例如,如果資源群組範圍中至少有五部機器有超過 80% 的 CPU 使用量,您可能會想要建立警示。
您可能會想要查看受影響的電腦警示清單。 您可以使用使用自訂資源圖形的自訂活頁簿來提供此檢視。 使用下列查詢來顯示警示,並使用活頁簿中的數據源 Azure Resource Graph 。
建立記錄搜尋警示規則
若要使用入口網站建立記錄搜尋警示規則,請參閱 此記錄搜尋警示範例,其中提供完整的逐步解說。 您可以使用這些相同的程式來建立 AKS 叢集的警示規則,方法是使用類似本文中的查詢。
若要使用 Azure Resource Manager (ARM) 範本建立查詢警示規則,請參閱 Azure 監視器中記錄搜尋警示規則的 Resource Manager 範例。 您可以使用這些相同的程式,為本文中的記錄查詢建立ARM範本。
資源使用率
每個分鐘成員節點的 CPU 使用率平均平均 CPU 使用率 (計量測量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SNode'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
| project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SNode'
| where CounterName == usageCounterName
| project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName
平均記憶體使用率作為成員節點每分鐘記憶體使用率的平均 (計量度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SNode'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
| project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SNode'
| where CounterName == usageCounterName
| project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName
重要
下列查詢會使用佔位元元值 <your-cluster-name> 和 <your-controller-name> 來代表您的叢集和控制器。 當您設定警示時,請將它們取代為您環境特有的值。
控制器中所有容器的平均CPU使用率,是控制器中每分鐘每個容器實例的平均CPU使用率(計量度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SContainer'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
| project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SContainer'
| where CounterName == usageCounterName
| project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName
控制器中所有容器的平均記憶體使用率,以每分鐘控制器中每個容器實例的平均記憶體使用率(計量度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SContainer'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
| project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SContainer'
| where CounterName == usageCounterName
| project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName
資源可用性
狀態為 Ready 和 NotReady 的節點和計數(計量度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
by ClusterName, Computer, bin(TimeGenerated, trendBinSize)
| extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project TimeGenerated,
ClusterName,
Computer,
ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc
下列查詢會根據所有階段傳回 Pod 階段計數: Failed
、 Pending
、 Unknown
、 Running
或 Succeeded
。
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| distinct ClusterName, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
| join hint.strategy=broadcast (
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
| summarize TotalCount = count(),
PendingCount = sumif(1, PodStatus =~ 'Pending'),
RunningCount = sumif(1, PodStatus =~ 'Running'),
SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
FailedCount = sumif(1, PodStatus =~ 'Failed')
by ClusterName, bin(TimeGenerated, trendBinSize)
) on ClusterName, TimeGenerated
| extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
| project TimeGenerated,
TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)
注意
若要在特定 Pod 階段發出警示,例如 Pending
、 Failed
或 Unknown
,請修改查詢的最後一行。 例如,若要在上 FailedCount
發出警示,請使用 | summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)
。
下列查詢會傳回已使用超過90%可用空間的叢集節點磁碟。 若要取得叢集識別碼,請先執行下列查詢,然後從 ClusterId
屬性複製值:
InsightsMetrics
| extend Tags = todynamic(Tags)
| project ClusterId = Tags['container.azm.ms/clusterId']
| distinct tostring(ClusterId)
let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'
| where Namespace == 'container.azm.ms/disk'
| extend Tags = todynamic(Tags)
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val
| where ClusterId =~ clusterId
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90
當個別系統容器重新啟動計數超過過去 10 分鐘的閾值時,個別容器重新啟動(結果數目)警示:
let _threshold = 10m;
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold);
let starttime = ago(5m);
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name
下一步
- 檢視 記錄查詢範例 ,以查看預先定義的查詢和範例,以評估或自定義警示、可視化或分析叢集。
- 若要深入瞭解 Azure 監視器以及如何監視 Kubernetes 叢集的其他層面,請參閱 檢視 Kubernetes 叢集效能 和 檢視 Kubernetes 叢集健康情況。