How to integrate Cluster Performance metrics into 3rd party alerting/monitoring software (new relic/datadog/etc)?

Kenneth Huddleston 145 Reputation points
2023-05-08T23:21:01.02+00:00

I would like to be able to export performance metrics from Azure Databricks to external tooling. Metrics include CPU utilization, network activity, etc. There are a number of open-source / 3rd party agents that attempt to address this problem, but they all seem to be experimental. I would be reluctant to use any of them in a production environment.

What 'supported' and 'reliable' options are available to export performance metrics from Azure Databricks?

  • Is there a way to get these metrics from the Databricks API?
  • Is there a safe way to incorporate these metrics into log analytics, azure monitor, or a sql database?
  • Any other options?

I appreciate the support! Really looking to hear what the 'safest' or 'recommended' options are for pulling data like this.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,019 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 82,671 Reputation points Microsoft Employee
    2023-05-10T02:39:35.7+00:00

    @Kenneth Huddleston - Thanks for the question and using MS Q&A platform.

    Azure Databricks provides several options to export performance metrics to external tooling. Here are some of the recommended options:

    Azure Monitor: You can use Azure Monitor to collect and analyze performance metrics from Azure Databricks. Azure Monitor provides a centralized platform for monitoring and alerting on performance metrics across your entire Azure environment. You can use Azure Monitor to collect metrics such as CPU utilization, network activity, and memory usage from Azure Databricks clusters and send them to external tooling such as Log Analytics or a SQL database.

    Databricks REST API: You can use the Databricks REST API to programmatically retrieve performance metrics from Azure Databricks clusters. The API provides endpoints for retrieving metrics such as CPU utilization, network activity, and memory usage. You can use the API to integrate performance metrics into external tooling such as New Relic or Datadog.

    Databricks Monitoring: Databricks provides built-in monitoring capabilities that allow you to monitor the performance of your clusters in real-time. You can use the Databricks monitoring UI to view metrics such as CPU utilization, network activity, and memory usage. You can also configure alerts to notify you when performance metrics exceed certain thresholds.

    Databricks Metrics Export: Databricks provides a built-in metrics export feature that allows you to export performance metrics to external tooling such as Prometheus or Graphite. You can use the metrics export feature to export metrics such as CPU utilization, network activity, and memory usage.

    In terms of the safest and recommended options, using Azure Monitor to collect and analyze performance metrics is a recommended approach. Azure Monitor provides a centralized platform for monitoring and alerting on performance metrics across your entire Azure environment, and it is a well-supported and reliable option. Additionally, using the Databricks REST API to programmatically retrieve performance metrics is also a recommended approach, as it provides a flexible and customizable way to integrate performance metrics into external tooling.

    For more details, refer to the below links:

    Monitoring Azure Databricks

    Monitor Model Serving endpoints with Prometheus and Datadog

    Monitor Databricks with Datadog

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Kenneth Huddleston 145 Reputation points
    2023-09-20T14:50:48.85+00:00

    Thanks for the help!

    My solution to address the problem was utilizing init scripts (bash) and an infrastructure agent from my telemetry provider (New Relic, in this case). My bash script executes on cluster creation, installs the New Relic Infrastructure Agent, which then harvests utilization metric and sends them to New Relic. I imagine there are similar infrastructure tools for DataDog or any number of other telemetry platforms.

    An extra benefit is that this infrastructure agent has also allowed me to harvest logs from the nodes in which it is running.

    0 comments No comments