コンテナーの分析情報からログ検索アラートを作成する

[アーティクル]
10/15/2024

Container insights により、マネージド Kubernetes クラスターまたは自己管理型 Kubernetes クラスターにデプロイされているコンテナーワークロードのパフォーマンスが監視されます。重要なことについてアラートを生成するため、この記事では、Azure Kubernetes Service (AKS) クラスターを使用して次の状況でログベースのアラートを作成する方法について説明します。

クラスターノードで CPU またはメモリの使用率がしきい値を超えたとき
対応するリソースに設定されている上限と比較して、コントローラー内のいずれかのコンテナーで CPU またはメモリの使用率がしきい値を超えたとき
NotReady 状態ノードの数
Failed、Pending、Unknown、Running、Succeeded のポッドフェーズ数
クラスターノードで空きディスク領域がしきい値を超えたとき

クラスターノードで CPU またはメモリの使用率が高いこと、またはクラスターノードで空きディスク領域が少ないことを警告するには、メトリックアラートまたはメトリック測定アラートを作成するために提供されているクエリを使用します。メトリックアラートの方がログ検索アラートより待ち時間は短くなりますが、ログ検索アラートでは高度なクエリと、より洗練された機能が提供されます。ログ検索アラートのクエリでは、now 演算子を使い、1 時間遡って、日時と現時点が比較されます。 (コンテナーの分析情報では、すべての日付が協定世界時 (UTC) 形式で保存されます。)

重要

この記事のクエリは、Container Insights によって収集され、Log Analytics ワークスペースに保存されるデータに依存します。既定のデータ収集の設定を変更した場合、クエリは期待される結果を返さない可能性があります。特に、クラスターの Prometheus メトリックを有効にした後でパフォーマンスデータの収集を無効にした場合、Perf テーブルを使用するクエリは結果を返しません。

パフォーマンスデータ収集の無効化などの事前設定構成については、「データ収集ルールを使用して Container Insights でデータ収集を構成する」を参照してください。詳細なデータ収集オプションについては、「ConfigMap を使用して Container Insights でデータ収集を構成する」を参照してください。

Azure Monitor のアラートに詳しくない場合は、開始する前に、Microsoft Azure のアラートの概要に関するページを参照してください。ログクエリを使うアラートについて詳しくは、Azure Monitor でのログ検索アラートに関する記事をご覧ください。メトリックアラートの詳細については、Azure Monitor でのメトリックアラートに関するページを参照してください。

ログクエリの測定

ログ検索アラートでは 2 つの異なるものを測定でき、さまざまなシナリオでの仮想マシンの監視に使用できます。

結果数: クエリによって返された行の数をカウントします。Windows イベントログ、Syslog、アプリケーション例外などのイベントを操作するのに使用できます。
値の計算: 数値列に基づいて計算を行います。任意の数のリソースを含めるために使用できます。たとえば、CPU の割合です。

ターゲットリソースとディメンション

ディメンションを使用すると、1 つのルールで複数のインスタンスの値を監視できます。たとえば、Web サイトやアプリを実行している複数のインスタンスの CPU 使用率を監視して、CPU 使用率が 80% を超えた場合のアラートを作成したい場合、ディメンションを使用します。

サブスクリプションまたはリソースグループに対して大規模なリソース中心型アラートを作成するには、[ディメンションで分割] を使用することができます。複数の Azure リソースで同じ条件を監視する場合、ディメンションによる分割では、数値列または文字列の列を使用して一意の組み合わせをグループ化することで、アラートが個別のアラートに分割されます。 Azure リソース ID 列を分割すると、指定したリソースがアラートターゲットになります。

また、スコープ内の複数のリソースに対する条件が必要な場合は、分割しない決断をすることも可能です。たとえば、リソースグループスコープ内の少なくとも 5 台のマシンで CPU 使用率が 80% を超えたらアラートを作成する場合などです。

影響を受けるコンピューターごとにアラートの一覧を確認できます。カスタムのリソースグラフを使用するカスタムブックを使用して、このビューを提供できます。次のクエリを使用してアラートを表示し、ブック内のデータソース Azure Resource Graph を使用します。

ログ検索アラートルールを作成する

ポータルを使ってログ検索アラートルールを作成するには、こちらのログ検索アラートの例で提供されている完全なチュートリアルをご覧ください。このアーティクルに示すクエリと同様のものを使用することで、これらの同じプロセスで AKS クラスターに対するアラートルールを作成できます。

Azure Resource Manager (ARM) テンプレートを使ってクエリアラートルールを作成するには、「Azure Monitor のログ検索アラートルール用の Resource Manager テンプレートのサンプル」をご覧ください。これらの同じプロセスを使って、この記事のログクエリ用の ARM テンプレートを作成できます。

リソース使用率

平均 CPU 使用率: メンバーノードの 1 分ごとの CPU 使用率の平均 (メトリック測定)。

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

平均メモリ使用率: メンバーノードの 1 分ごとのメモリ使用率の平均 (メトリック測定)。

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

重要

次のクエリでは、プレースホルダー値 <your-cluster-name> および <your-controller-name> を使用して、クラスターとコントローラーを表します。アラートを設定するときに、プレースホルダーを環境に固有の値に置き換えてください。

コントローラー内のすべてのコンテナーの平均 CPU 使用率: コントローラー内の全コンテナーインスタンスの 1 分ごとの CPU 使用率の平均 (メトリック測定)。

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

コントローラー内のすべてのコンテナーの平均メモリ使用率: コントローラー内の全コンテナーインスタンスの 1 分ごとのメモリ使用率の平均 (メトリック測定)。

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

リソースの可用性

状態が Ready および NotReady であるノードとカウント (メトリック測定)。

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
    KubeNodeInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
                by ClusterName, Computer,  bin(TimeGenerated, trendBinSize)
    | extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project   TimeGenerated,
            ClusterName,
            Computer,
            ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
            NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc

次のクエリでは、すべてのフェーズに基づくポッドフェーズ数が返されます: Failed、Pending、Unknown、Running、Succeeded。

let endDateTime = now(); 
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ClusterName == clusterName
    | distinct ClusterName, TimeGenerated
    | summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
    | join hint.strategy=broadcast (
        KubePodInventory
        | where TimeGenerated < endDateTime
        | where TimeGenerated >= startDateTime
        | summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
        | summarize TotalCount = count(),
                    PendingCount = sumif(1, PodStatus =~ 'Pending'),
                    RunningCount = sumif(1, PodStatus =~ 'Running'),
                    SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
                    FailedCount = sumif(1, PodStatus =~ 'Failed')
                by ClusterName, bin(TimeGenerated, trendBinSize)
    ) on ClusterName, TimeGenerated
    | extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
    | project TimeGenerated,
              TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
              PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
              RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
              SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
              FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
              UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)

注意

Pending、Failed、Unknown などの特定のポッドフェーズについてのアラートを生成するには、クエリの最後の行を変更します。たとえば、FailedCount でアラートを生成するには、| summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize) を使用します。

次のクエリは、空き領域の使用が 90% を超えるクラスターノードのディスクを返します。クラスター ID を取得するには、まず、次のクエリを実行し、ClusterId プロパティの値をコピーします。

InsightsMetrics
| extend Tags = todynamic(Tags)            
| project ClusterId = Tags['container.azm.ms/clusterId']   
| distinct tostring(ClusterId)

let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'            
| where Namespace == 'container.azm.ms/disk'            
| extend Tags = todynamic(Tags)            
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val   
| where ClusterId =~ clusterId       
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90

個々のコンテナーの再起動 (結果の数) アラートは、個々のシステムコンテナーの再起動数が過去 10 分間のしきい値を超えたときに発生します。

let _threshold = 10m; 
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold); 
let starttime = ago(5m); 
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name

次のステップ

ログクエリの例を表示して、事前定義されたクエリや例を確認し、クラスターのアラート、視覚化、または分析のために評価やカスタマイズを行います。
Azure Monitor と、Kubernetes クラスターの他の側面を監視する方法の詳細については、Kubernetes クラスターのパフォーマンスの表示および Kubernetes クラスターの正常性の表示に関するページをご覧ください。

次の方法で共有

コンテナーの分析情報からログ検索アラートを作成する

ログクエリの測定

ターゲットリソースとディメンション

ログ検索アラートルールを作成する

リソース使用率

リソースの可用性

次のステップ

フィードバック

その他のリソース

次の方法で共有

コンテナーの分析情報からログ検索アラートを作成する

ログ クエリの測定

ターゲット リソースとディメンション

ログ検索アラート ルールを作成する

リソース使用率

リソースの可用性

次のステップ

フィードバック

その他のリソース

ログクエリの測定

ターゲットリソースとディメンション

ログ検索アラートルールを作成する