Exercise - List recently active virtual machines that stopped sending logs

Completed

Here, you'll write KQL queries to retrieve and transform data from the Heartbeat table to obtain insights about the status of machines in your environment.

1. Set goals

Your first log analysis goal is to ensure you're getting data about all active virtual machines in your network. You want to identify machines that stop sending data to ensure you have full visibility of all active virtual machines.

To determine which machines have stopped sending data, you need information about:

  • All machines that have recently logged data, but haven't logged data as expected in the past few minutes.
  • For deeper analysis, it's useful to know which virtual machine agent is running on each machine.

2. Assess logs

Azure Monitor uses Azure Monitor Agent to collect data about activities and operating system processes running inside virtual machines.

Note

Some of the older machines in your environment still use the legacy Log Analytics Windows and Linux agents, which Azure Monitor is deprecating.

Azure Monitor Agent and Log Analytics Agent send virtual machine health data to the Heartbeat table once a minute.

Let's run a simple take 10 query on the Heartbeat table to see the type of data each one of its columns holds:

Click to run query in Log Analytics demo environment

Heartbeat
| take 10

The TimeGenerated, Computer, Category, and OSType columns all have data that's relevant to our analysis.

Screenshot that shows the results of a take 10 query on the Heartbeat table with the TimeGenerated, Computer, Category, and OSType columns highlighted.

Now let's assess how we can use this data and which KQL operations can help extract and transform the data:

Column Description Analysis goal Related KQL operations
TimeGenerated Indicates when the virtual machine generated each log.
  • Identify recently active machines.
  • Find the last log generated for each machine and check whether it was generated in the last few minutes.
  • where TimeGenerated >ago(48h)
  • summarize max(TimeGenerated)
  • max_TimeGenerated < ago(5m)
For more information, see where operator, summarize operator, ago(), and max() (aggregation function).
Computer Unique identifier of the machine.
  • Summarize results by machine.
  • Group machines by distinct agent versions.
  • summarize by Computer
  • summarize ComputersList=make_set(Computer)
For more information, see summarize operator and make_set() (aggregation function).
Category The agent type:
  • Azure Monitor Agent or
  • Direct Agent, which represents the Log Analytics agents. The Log Analytics agent for Windows is also called MMA. The Log Analytics agent for Linux is also called OMS.
Identify the agent running on the machine. To simplify the results and facilitate further analysis, such as filtering:
  • Rename the column to AgentType (AgentType=Category)
  • Change the Direct Agent value to MMA for Windows machines (AgentType= iif(AgentType == "Direct Agent" and OSType =="Windows", "MMA", AgentType).
  • Change the Direct Agent value to OMS for Linux machines (AgentType= iif(AgentType == "Direct Agent" and OSType =="Linux", "OMS", AgentType).
For more information, see iff() and == (equals) operator.
OSType The type of operating system running on the virtual machine. Identify agent type for Log Analytics agents, which are different for Windows and Linux. summarize by... OSType
For more information, see summarize operator.
Version The version number of the agent monitoring the virtual machine. Identify the agent version on each machine. Rename the column to AgentVersion (AgentVersion=Version).

3. Write your query

Write a query that lists the machines that have been active in the past 48 hours, but haven't logged data to the Heartbeat table in the last five minutes.

  1. Retrieve all logs from the past 48 hours:

    Click to run query in Log Analytics demo environment

    Heartbeat // The table you’re querying
    | where TimeGenerated >ago(48h) // Time range for the query - in this case, logs generated in the past 48 hours
    

    The result set of this query includes logs from all of the machines that sent log data in the past 48 hours. These results likely include numerous logs for each active machine.

    Screenshot that shows the results of a query on the Heartbeat table for all records generated in the past 48 hours.

    To understand which machines haven't recently sent logs, you only need the last log each machine sent.

  2. Find the last log generated by each machine and summarize by computer, agent type, and operating system:

    Click to run query in Log Analytics demo environment

    Heartbeat // The table you’re querying
    | where TimeGenerated >ago(48h) // Time range for the query - in this case, logs generated in the past 48 hours
    | summarize max(TimeGenerated) by Computer, AgentType=Category, OSType // Retrieves the last record generated by each computer and provides information about computer, agent type, and operating system
    

    You now have one log from each machine that logged data in the past 48 hours - the last log each machine sent.

    In the summarize line, you've renamed the Category column to AgentType, which better describes the information you're looking at in the column as part of this analysis.

    Screenshot that shows the results of a query for the last log generated by each machine.

  3. To see which machines haven't sent logs in the last five minutes, filter away all logs generated in the last five minutes:

    Click to run query in Log Analytics demo environment

    Heartbeat // The table you’re querying
    | where TimeGenerated >ago(48h) // Time range for the query - in this case, logs generated in the past 48 hours
    | summarize max(TimeGenerated) by Computer, AgentType=Category, OSType // Retrieves the last record generated by each computer and provides information about computer, agent type, and operating system
    | where max_TimeGenerated < ago(5m) // Filters away all records generated in the last five minutes
    

    The result set of this query includes the last log generated by all machines that logged data in the past 48 hours, but doesn't include logs generated in the past five minutes. In other words, any machine that logged data in the last five minutes isn't included in the result set.

    Screenshot that shows the results of a query that filters away all records generated in the last five minutes.

    You now have the data you're looking for: a list of all machines that logged data in the last 48 hours, but haven't been logging data as expected in the last five minutes. The result set consists of the set of computers you want to investigate further.

  4. Manipulate the query results to present the information more clearly.

    For example, you can organize the logs by time generated - from the oldest to the newest - to see which computers have gone the longest time without logging data.

    The Direct Agent value in the AgentType column tells you that the Log Analytics Agent is running on the machine. Since the Log Analytics Agent for Windows is also called OMS and for Linux the agent is also called MMS, renaming the Direct Agent value to MMA for Windows machines and OMS for Linux machines simplifies the results and facilitates further analysis, such as filtering.

    Click to run query in Log Analytics demo environment

    Heartbeat // The table you’re querying
    | where TimeGenerated >ago(48h) // Time range for the query - in this case, logs generated in the past 48 hours 
    | summarize max(TimeGenerated) by Computer,AgentType=Category, OSType // Retrieves the last record generated by each computer and provides information about computer, agent type, and operating system
    | where max_TimeGenerated < ago(5m) // Filters away all records generated in the last five minutes
    | extend AgentType= iif(AgentType == "Direct Agent" and OSType =="Windows", "MMA", AgentType) // Changes the AgentType value from "Direct Agent" to "MMA" for Windows machines
    | extend AgentType= iif(AgentType == "Direct Agent" and OSType =="Linux", "OMS", AgentType) // Changes the AgentType value from "Direct Agent" to "OMS" for Linux machines
    | order by max_TimeGenerated asc // Sorts results by max_TimeGenerated from oldest to newest
    | project-reorder max_TimeGenerated,Computer,AgentType,OSType  // Reorganizes the order of columns in the result set
    

    Tip

    Use max_TimeGenerated to correlate the last heartbeat of the machine that stopped reporting with machine logs or other environmental events that occurred around the same time. Correlating logs in this way can help in finding the root cause of the issue you are investigating.

    Screenshot that shows the results of a query that changes the AgentType values to MMA for Windows machines and to OMS for Linux machines.

Challenge: Group machines by monitoring agent and agent version

Understanding which agents and agent versions are running on your machines can help you analyze the root cause of problems and identify which machines you need to update to a new agent or new agent version.

Can you think of a couple of quick tweaks you can make to the query you developed above to get this information?

Consider this:

  • Which additional information do you need to extract from your logs?
  • Which KQL operation can you use to group machines by the agent version they're running?

Solution:

  1. Copy the first five lines from the query and add the Version column to the summarize line of the query to extract agent version information:

    Click to run query in Log Analytics demo environment

    Heartbeat // The table you’re querying
    | where TimeGenerated >ago(48h) // Time range for the query - in this case, logs generated in the past 48 hours 
    | summarize max(TimeGenerated) by Computer,AgentType=Category, OSType, Version // Retrieves the last record generated by each computer and provides information about computer, agent type, operating system, and agent version 
    | extend AgentType= iif(AgentType == "Direct Agent" and OSType =="Windows", "MMA", AgentType) // Changes the AgentType value from "Direct Agent" to "MMA" for Windows machines
    | extend AgentType= iif(AgentType == "Direct Agent" and OSType =="Linux", "OMS", AgentType) // Changes the AgentType value from "Direct Agent" to "OMS" for Linux machines
    

    Screenshot that shows the results of the first five lines of the query we've built up in this exercise, with the Version column added to the Summarize line to add agent version information to the results.

  2. Rename the Version column to AgentVersion for clarity, add another summarize line to find unique combinations of agent type, agent version, and operating system type, and use the KQL make_set() aggregate function to list all computers running each combination of agent type and agent version:

    Click to run query in Log Analytics demo environment

    Heartbeat // The table you’re querying
    | where TimeGenerated >ago(48h) // Time range for the query - in this case, logs generated in the past 48 hours 
    | summarize max(TimeGenerated) by Computer,AgentType=Category, OSType, Version // Retrieves the last record generated by each computer and provides information about computer, agent type, operating system, and agent version 
    | extend AgentType= iif(AgentType == "Direct Agent" and OSType =="Windows", "MMA", AgentType) // Changes the AgentType value from "Direct Agent" to "MMA" for Windows machines
    | extend AgentType= iif(AgentType == "Direct Agent" and OSType =="Linux", "OMS", AgentType) // Changes the AgentType value from "Direct Agent" to "OMS" for Linux machines
    | summarize ComputersList=make_set(Computer) by AgentVersion=Version, AgentType, OSType // Summarizes the result set by unique combination of agent type, agent version, and operating system, and lists the set of all machines running the specific agent version
    

    You now have the data you're looking for: a list of unique combinations of agent type and agent version and the set of all recently active machines that are running a specific version of each agent.

    Screenshot that shows the results of a query that creates a list of all machines running each unique combination of agent type, agent version, and operating system.