Identify errors in AKS logs

Akeem Rajifuja 20 Reputation points
2024-05-28T23:06:36.67+00:00

I am trying to find KQL queries to identify errors in logs related to an AKS cluster. The tables I am working with are KubePodInventory, ContainerLog, StorageFileLogs, and AzureDiagnostics, which all point to a log analytics workspace. The errors I am looking for include "OOMKilled," "Error," "Unknown," and "CrashLoopBackOff." I understand that the queries might return blank if there's no row that meets the criteria, but I need help identifying the correct query to use so I can set up alerts. Here's an example of a query I've tried using:

KubePodInventory

| where ContainerLastStatus has_any("Error", "Unknown","OOMKilled","CrashLoopBackOff","DeadlineExceeded")

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,456 questions
{count} votes

Accepted answer
  1. Prrudram-MSFT 28,281 Reputation points Moderator
    2024-05-29T09:53:53.0533333+00:00

    Hello @Akeem Rajifuja,

    Thank you for reaching out to the Microsoft Q&A platform.

    Your query looks good and should return results for the specified errors in the KubePodInventory table. However, you can also try the following queries to identify errors in logs related to an AKS cluster:
    To identify OOMKilled errors in the ContainerLog table:

    ContainerLog | where LogEntry contains "OOMKilled"

    To identify errors in the StorageFileLogs table:
    StorageFileLogs | where LogEntry contains "Error"

    To identify Unknown errors in the AzureDiagnostics table:
    AzureDiagnostics | where Category == "kube-system" and Message contains "Unknown"

    To identify CrashLoopBackOff errors in the KubePodInventory table:
    KubePodInventory | where ContainerStatus contains "CrashLoopBackOff"

    You can also combine these queries to create a single query that searches across all tables.
    For example:
    union KubePodInventory, ContainerLog, StorageFileLogs, AzureDiagnostics | where ContainerLastStatus has_any("Error", "Unknown", "OOMKilled", "CrashLoopBackOff", "DeadlineExceeded")

    Reference : https://learn.microsoft.com/en-us/azure/aks/monitor-aks-reference

    https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-log-query

    https://learn.microsoft.com/en-us/azure/azure-monitor/reference/tables/azurediagnostics

    https://learn.microsoft.com/en-us/azure/storage/files/storage-files-monitoring

    https://learn.microsoft.com/en-us/azure/azure-monitor/reference/tables/storagefilelogs

    https://learn.microsoft.com/en-us/azure/azure-monitor/logs/scope

    Hope this helps!

    If I have answered your query, please click "Accept as answer" as a token of appreciation

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. AlaaBarqawi_MSFT 942 Reputation points Microsoft Employee
    2024-05-29T06:49:23.7233333+00:00

    Hi @Akeem Rajifuja can you try this query for OOMKilled

    KubePodInventory 
    | where PodStatus != "running"
    | extend ContainerLastStatusJSON = parse_json(ContainerLastStatus)
    | extend FinishedAt = todatetime(ContainerLastStatusJSON.finishedAt)
    | where ContainerLastStatusJSON.reason == "OOMKilled"
    | distinct PodUid, ControllerName, ContainerLastStatus, FinishedAt
    | order by FinishedAt asc
    

    crashloopback / Error pods :

    //Determines whether Pods/Containers has Crash-Loop phase
    KubePodInventory
    | where ContainerStatus  == 'waiting' 
    | where ContainerStatusReason == 'CrashLoopBackOff' or ContainerStatusReason == 'Error'
    | extend ContainerLastStatus=todynamic(ContainerLastStatus)
    | summarize RestartCount = arg_max(ContainerRestartCount, Computer, Namespace, ContainerLastStatus.reason) by Name
    
    

    https://learn.microsoft.com/en-us/azure/azure-monitor/reference/queries/kubepodinventory

    Based on your data you can filter the required errors and setup Alerts on top of it

    Pending Pods

    //Check Pods that cannot be started and its pending time
    KubePodInventory
    | where PodStatus == 'Pending'
    | project PodCreationTimeStamp, Namespace, PodStartTime, PodStatus, Name, ContainerStatus
    | summarize Start = any(PodCreationTimeStamp), arg_max(PodStartTime, Namespace) by Name
    | extend PodStartTime = iff(isnull(PodStartTime), now(), PodStartTime)
    | extend PendingTime = PodStartTime - Start
    | project Name, Namespace ,PendingTime
    
    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.