How to integrate Cluster Performance metrics into 3rd party alerting/monitoring software (new relic/datadog/etc)?

Question

How to integrate Cluster Performance metrics into 3rd party alerting/monitoring software (new relic/datadog/etc)?

Kenneth Huddleston 145

I would like to be able to export performance metrics from Azure Databricks to external tooling. Metrics include CPU utilization, network activity, etc. There are a number of open-source / 3rd party agents that attempt to address this problem, but they all seem to be experimental. I would be reluctant to use any of them in a production environment.

What 'supported' and 'reliable' options are available to export performance metrics from Azure Databricks?

Is there a way to get these metrics from the Databricks API?
Is there a safe way to incorporate these metrics into log analytics, azure monitor, or a sql database?
Any other options?

I appreciate the support! Really looking to hear what the 'safest' or 'recommended' options are for pulling data like this.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-05-19T06:29:57.9966667+00:00

@Kenneth Huddleston - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

1 additional answer

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-05-19T06:29:57.9966667+00:00

@Kenneth Huddleston - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

@Kenneth Huddleston - Thanks for the question and using MS Q&A platform.

Azure Databricks provides several options to export performance metrics to external tooling. Here are some of the recommended options:

Azure Monitor: You can use Azure Monitor to collect and analyze performance metrics from Azure Databricks. Azure Monitor provides a centralized platform for monitoring and alerting on performance metrics across your entire Azure environment. You can use Azure Monitor to collect metrics such as CPU utilization, network activity, and memory usage from Azure Databricks clusters and send them to external tooling such as Log Analytics or a SQL database.

Databricks REST API: You can use the Databricks REST API to programmatically retrieve performance metrics from Azure Databricks clusters. The API provides endpoints for retrieving metrics such as CPU utilization, network activity, and memory usage. You can use the API to integrate performance metrics into external tooling such as New Relic or Datadog.

Databricks Monitoring: Databricks provides built-in monitoring capabilities that allow you to monitor the performance of your clusters in real-time. You can use the Databricks monitoring UI to view metrics such as CPU utilization, network activity, and memory usage. You can also configure alerts to notify you when performance metrics exceed certain thresholds.

Databricks Metrics Export: Databricks provides a built-in metrics export feature that allows you to export performance metrics to external tooling such as Prometheus or Graphite. You can use the metrics export feature to export metrics such as CPU utilization, network activity, and memory usage.

In terms of the safest and recommended options, using Azure Monitor to collect and analyze performance metrics is a recommended approach. Azure Monitor provides a centralized platform for monitoring and alerting on performance metrics across your entire Azure environment, and it is a well-supported and reliable option. Additionally, using the Databricks REST API to programmatically retrieve performance metrics is also a recommended approach, as it provides a flexible and customizable way to integrate performance metrics into external tooling.

For more details, refer to the below links:

Monitoring Azure Databricks

Monitor Model Serving endpoints with Prometheus and Datadog

Monitor Databricks with Datadog

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Kenneth Huddleston 145 Reputation points

2023-05-10T14:16:54.7533333+00:00

Thanks so much for the response!

A question about the second link you provided. "Monitor Model Serving Endpoints with Prometheus and Datadog". Is that just for Machine Learning flows? My environment is not using the Machine Learning side of databricks, just the data processing. I am looking to get performance metrics on All Purpose Clusters and Job Clusters. Would that API work for this purpose? I was attempting to run an API call to list all service endpoint on my environment, just like shown here: https://docs.databricks.com/api-explorer/workspace/servingendpoints/list

My response/output was blank.
Kenneth Huddleston 145 Reputation points

2023-05-10T14:21:26.5933333+00:00
Also, looking at "Monitoring Azure Databricks". Being able to funnel information from Azure Databricks into Azure Monitor would be great! When you look at the documentation, though, it provides a reference to a 'spark monitoring' library on GitHub. This is the link: https://github.com/mspnp/spark-monitoring/blob/main/README.md

This library, which the Microsoft documentation is based on, states :

This library supports Azure Databricks 10.x (Spark 3.2.x) and earlier (see Supported configurations). Azure Databricks 11.0 includes breaking changes to the logging systems that the spark-monitoring library integrates with. The work required to update the spark-monitoring library to support Azure Databricks 11.0 (Spark 3.3.0) and newer is not currently planned.

I would be worried about putting a solution into production that 'appears' to be broken for newer releases, without any plans for repair. Do you know if this is still a problem?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-05-11T03:58:06.4833333+00:00

@Kenneth Huddleston - Apologize for confusion on the second link. The reason for providing the Monitor Model Serving Endpoints with Prometheus and Datadog is incase if you want to get metrics for the Machine learning.

Retrieve all serving endpoints - list ML endpoints available in your workspace. In case, if you don't have anything it will return as blank.

As per the GitHub page says: The work required to update the spark-monitoring library to support Azure Databricks 11.0 (Spark 3.3.0) and newer is not currently planned.

And if you checkout the open issue on the same GitHub page or create a new issue for request help or update on the newer versions.

Meanwhile, we are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.
Kenneth Huddleston 145 Reputation points

2023-05-11T14:50:31.55+00:00
I appreciate the follow up, but that does appear to prove my concern. We are using a modern/recent release of Databricks (version 11+) and the Data Platform (not machine learning). So for us, what are the options to get performance level metrics on our All Purposes Clusters and Job Clusters?

Is there a REST API endpoint that provides performance metrics for these particular clusters?

Is there an integration that doesn't rely on a broken monitoring library?

Needless to say, a monitoring solution doesn't do us any good if it can't be expected to work due to recent version updates.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-05-12T04:22:07.6033333+00:00

@Kenneth Huddleston - Here is a template that could be of use for your requirement: https://github.com/algattik/databricks-observability

Hope this helps. Do let us know if you any further queries.
Kenneth Huddleston 145 Reputation points

2023-05-12T05:17:08.79+00:00

@PRADEEPCHEEKATLA That link is return a '404' error. Not sure I have access?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-05-15T05:50:33.65+00:00

@Kenneth Huddleston - When you say you are getting 404 error can you please share the screenshot with url (https://github.com/algattik/databricks-observability)?
Kenneth Huddleston 145 Reputation points

2023-05-15T15:34:02.2333333+00:00

Interesting... this is the error I was getting.

If you look at the URL, it appears to be corrupted in some way. So it may just be the hyperlink in your post that was broken? Regardless, when I manually enter the link it appears to be working. Looking at the template now.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-05-17T04:05:28.22+00:00

@Kenneth Huddleston - Can you please manually copy (https://github.com/algattik/databricks-observability) and paste in browser and see if that works.

As per the testing I'm able to see the github page as shown below:

Hope this helps.
Bafna, Keerti 5 Reputation points

2023-08-01T14:47:25.51+00:00

@Kenneth Huddleston Were you able to get the cluster performance metrics ? I have been using the REST API for monitoring jobs but don't see an option to fetch cluster performance metrics.
Sudhan Roshan 0 Reputation points

2023-09-20T10:24:07.6+00:00

@Kenneth Huddleston @PRADEEPCHEEKATLA

Were you able to get the Azure databricks cluster metrics such as CPU Utilization and MEM Utilization. I can't see any related API in the databricks api reference to get these metrics.

Answer 2

Thanks for the help!

My solution to address the problem was utilizing init scripts (bash) and an infrastructure agent from my telemetry provider (New Relic, in this case). My bash script executes on cluster creation, installs the New Relic Infrastructure Agent, which then harvests utilization metric and sends them to New Relic. I imagine there are similar infrastructure tools for DataDog or any number of other telemetry platforms.

An extra benefit is that this infrastructure agent has also allowed me to harvest logs from the nodes in which it is running.

Share via

How to integrate Cluster Performance metrics into 3rd party alerting/monitoring software (new relic/datadog/etc)?

1 additional answer

Your answer