How to monitor cluster availability with Azure Monitor logs in HDInsight

HDInsight clusters include Azure Monitor logs integration, which provides queryable metrics and logs, as well as configurable alerts. This article shows how to use Azure Monitor to monitor your cluster.

Azure Monitor logs integration

Azure Monitor logs enable data generated by multiple resources, such as HDInsight clusters, to be collected and aggregated in one place to achieve a unified monitoring experience.

As a prerequisite, you'll need a Log Analytics Workspace to store the collected data. If you haven't already created one, you can follow instructions here: Create a Log Analytics Workspace.

Enable HDInsight Azure Monitor logs integration

From the HDInsight cluster resource page in the portal, select Azure Monitor. Then, select enable and select your Log Analytics workspace from the drop-down.

HDInsight Operations Management Suite.

By default, this installs the OMS agent on all of the cluster nodes except for edge nodes. Because no OMS agent is installed on cluster edge nodes, there is no telemetry on edge nodes present in Log Analytics by default.

Query metrics and logs tables

Once Azure Monitor log integration is enabled (this may take a few minutes), navigate to your Log Analytics Workspace resource and select Logs.

Log Analytics workspace logs.

Logs list a number of sample queries, such as:

Query name Description
Computers availability today Chart the number of computers sending logs, each hour
List heartbeats List all computer heartbeats from the last hour
Last heartbeat of each computer Show the last heartbeat sent by each computer
Unavailable computers List all known computers that didn't send a heartbeat in the last 5 hours
Availability rate Calculate the availability rate of each connected computer

As an example, run the Availability rate sample query by selecting Run on that query, as shown in the screenshot above. This will show the availability rate of each node in your cluster as a percentage. If you have enabled multiple HDInsight clusters to send metrics to the same Log Analytics workspace, you'll see the availability rate for all nodes (excluding edge nodes) in those clusters displayed.

Log Analytics workspace logs 'availability rate' sample query.

Note

Availability rate is measured over a 24-hour period, so your cluster will need to run for at least 24 hours before you see accurate availability rates.

You can pin this table to a shared dashboard by clicking Pin in the upper-right corner. If you don't have any writable shared dashboards, you can see how to create one here: Create and share dashboards in the Azure portal.

Azure Monitor alerts

You can also set up Azure Monitor alerts that will trigger when the value of a metric or the results of a query meet certain conditions. As an example, let's create an alert to send an email when one or more nodes hasn't sent a heartbeat in 5 hours (i.e. is presumed to be unavailable).

From Logs, run the Unavailable computers sample query by selecting Run on that query, as shown below.

Log Analytics workspace logs 'unavailable computers' sample.

If all nodes are available, this query should return zero results for now. Click New alert rule to begin configuring your alert for this query.

Log Analytics workspace new alert rule.

There are three components to an alert: the resource for which to create the rule (the Log Analytics workspace in this case), the condition to trigger the alert, and the action groups that determine what will happen when the alert is triggered. Click the condition title, as shown below, to finish configuring the signal logic.

Portal alert create rule condition.

This will open Configure signal logic.

Set the Alert logic section as follows:

Based on: Number of results, Condition: Greater than, Threshold: 0.

Since this query only returns unavailable nodes as results, if the number of results is ever greater than 0, the alert should fire.

In the Evaluated based on section, set the period and frequency based on how often you want to check for unavailable nodes.

For the purpose of this alert, you want to make sure Period=Frequency. More information about period, frequency, and other alert parameters can be found here.

Select Done when you're finished configuring the signal logic.

Alert rule configures signal logic.

If you don't already have an existing action group, click Create New under the Action Groups section.

Alert rule creates new action group.

This will open Add action group. Choose an Action group name, Short name, Subscription, and Resource group. Under the Actions section, choose an Action Name and select Email/SMS/Push/Voice as the Action Type.

Note

There are several other actions an alert can trigger besides an Email/SMS/Push/Voice, such as an Azure Function, LogicApp, Webhook, ITSM, and Automation Runbook. Learn More.

This will open Email/SMS/Push/Voice. Choose a Name for the recipient, check the Email box, and type an email address to which you want the alert sent. Select OK in Email/SMS/Push/Voice, then in Add action group to finish configuring your action group.

Alert rule creates add action group.

After these blades close, you should see your action group listed under the Action Groups section. Finally, complete the Alert Details section by typing an Alert Rule Name and Description and choosing a Severity. Click Create Alert Rule to finish.

Portal creates alert rule finish.

Tip

The ability to specify Severity is a powerful tool that can be used when creating multiple alerts. For example, you could create one alert to raise a Warning (severity 1) if a single head node goes down and another alert that raises Critical (severity 1) in the unlikely event that both head nodes go down.

When the condition for this alert is met, the alert will fire and you'll receive an email with the alert details like this:

Azure Monitor alert email example.

You can also view all alerts that have fired, grouped by severity, by going to Alerts in your Log Analytics Workspace.

Log Analytics workspace alerts.

When you select on a severity group (i.e. severity 1, as highlighted above) it will show records for all alerts of that severity that have fired like below:

Screenshot showing Log Analytics workspace severity  one alert.

Next steps