Observability for Azure Databricks

Azure Databricks is a robust analytics service built on Apache Spark. It empowers the swift development and deployment of extensive big data analytics solutions. Effective monitoring and troubleshooting of performance issues are paramount for seamless operation of production workloads on Azure Databricks. This article delineates diverse approaches to observability, offering comprehensive insights into each aspect. For detailed information on specific articles, refer to the [For more information](#For more information) section.

The organizational structure of Databricks' observability categorization is logs, metrics, and traces.

Analyzing logs

Logs are text-based records of events that occur while an application is running. The following listing shows some of the logs gathered for Databricks.

Analyze cluster event logs

Cluster event logs provide insights into crucial events related to the lifecycle of a cluster. Events, triggered by user actions or automatically by Azure Databricks, contain information such as, timestamps, event types, and details specific to an event. These logs are instrumental in understanding and analyzing the overall operation of a cluster and its affect on running jobs.

  • What is captured: Cluster event logs capture events such as cluster creation, starting, restarting, terminating, and other lifecycle events. For a comprehensive list of supported event types, refer to the ClusterEventType.

  • How to access: Access cluster event logs from the Databricks workspace UI, specifically from the Event Log tab on the cluster detail page. Although exporting logs to Azure Monitor isn't supported by default, external access is possible through the Cluster Events REST API endpoint.

  • When to use: Cluster event logs are valuable for analyzing cluster lifecycle events, including terminations, resizing, and the success or failure of Init script execution.

Analyze cluster driver and worker logs

The logs from the Spark driver and worker nodes provide direct print and log statements from notebooks, jobs, and libraries used in Spark applications.

  • What is captured: Driver logs capture standard output, standard error, and Log4j logs. Spark worker logs, available in the Spark UI, offer insights into the log information and error messages printed by Spark developers during application execution.

  • How to access: Access driver logs from the Driver logs tab on the cluster details page. Configure Spark worker logs for delivery to DBFS using Cluster Log Delivery.

  • When to use: Use these logs when you need to access information and error messages printed by Spark developers during the execution of Spark applications.

Analyze cluster init script logs

Init script logs generate during the startup of each cluster node. Init scripts and shell scripts execute before the Spark driver or worker JVM starts, performing tasks like installing packages, modifying the JVM system classpath, and setting system properties.

  • Installing packages and libraries.
  • Modifying the JVM system classpath.
  • Setting system properties and environment variables.
  • Modifying Spark configuration parameters.

The logs generated from the execution of the init scripts are referred to as init script logs. These logs are useful to troubleshoot cluster initialization errors.

  • What is captured: Init script logs capture information generated during the execution of init scripts.

  • How to access: When cluster log delivery isn't configured, logs write to /databricks/init_scripts. Here's an example of using a notebook to list and view the logs:

    ls /databricks/init_scripts/
    cat /databricks/init_scripts/<timestamp>_<log-id>_<init-script-name>.sh.stdout.log

    If cluster log delivery has been configured, the init script logs write to /[cluster-log-path]/[cluster-id]/init_scripts. Logs for each container in the cluster write to a subdirectory called init_scripts/<cluster_id>_<container_ip>.

    For more information about init script logging, refer to Cluster node initialization scripts > Logging

  • When to use: Init script logs are useful for troubleshooting cluster initialization errors.

Monitor Azure diagnostic logs

Azure diagnostic logs for Azure Databricks capture information about activities and events within the Databricks environment.

  • What is captured: Diagnostic logs cover events from various Databricks services/categories, including clusters, Databricks file system (DBFS), jobs, and notebooks. The category and operationName properties identify events in a log record. For a list of each of these types of events and the associated services, see Events. Some events emit to audit logs only if verbose audit logs have been enabled for the workspace.

  • How to access: Access diagnostic logs based on configured log delivery. Enable logs through the Azure portal or Azure CLI, with options to archive to a storage account, send to Log Analytics, or stream to an Event Hub.

  • When to use: Diagnostic logs are essential for governance, monitoring, and auditing Databricks deployment activities. For example, they can be used to monitor who created or deleted the cluster, created a notebook, attached a notebook to the cluster, scheduled a job start or end, and a lot more.

Analyze metrics

Metrics are numerical values to analyze. Use them to observe a system in real time (or close to real time) or to analyze performance trends over time. Listed here are some of the metrics gathered for Databricks.

Analyze cluster Metrics via Ganglia

Ganglia metrics provide insights into Databricks cluster performance, monitoring telemetry related to CPU, memory usage, network, and more.

  • What is captured: Ganglia collects various system metrics, including CPU, memory, disk, network, and process-related values. Listed here are a few common metrics captured via Ganglia:
Metric name Reporting units Description Type
cpu_idle Percent Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk IO request CPU
cpu_user Percent Percentage of CPU utilization that occurred while executing at the user level CPU
cpu_num Count Total number of CPUs (collected once) CPU
disk_total Gb Total available disk space, aggregated over all partitions Disk
disk_free Gb Total free disk space, aggregated over all partitions Disk
mem_total Kb Total amount of memory Memory
  • How to access: Access Ganglia metrics directly within Azure Databricks, by navigating to the Metrics tab on the cluster details page.

  • When to use: Ganglia metrics aid in understanding cluster configuration and performance, and help determine if a cluster is under-provisioned, over-provisioned, or appropriately provisioned.

Analyzing cluster metrics via OMS agent

An important facet of monitoring is to understand the resource utilization in Azure Databricks clusters. This information is useful in arriving at the correct cluster and VM sizes. Each VM does have a set of limits (cores/disk throughput/network throughput) which play an important role in determining the performance profile of an Azure Databricks job. Utilization metrics for an Azure Databricks cluster is obtained by streaming VM metrics to an Azure Log Analytics Workspace. To achieve this goal, the Log Analytics Agent needs to be installed on each node of the cluster.

  • What is captured: It captures metrics on the utilization of different resources (memory, processor, disk, etc.) of the cluster VMs.

  • How to access: To access the data from Azure Monitor, you must first successfully install the OMS Agent on the cluster VM.

    For details about how to install the OMS Agent in Databricks, see the Installation instructions.

  • When to use: Cluster Metrics via OMS Agent serves as an alternative to Ganglia, especially for teams interested in using Azure Monitor.

Analyzing traces

Traces, also called operations, connect the steps of a single request across multiple calls within and across services. They can provide structured observability into the interactions of system components.

Using the Spark monitoring library

The Spark Monitoring Library enhances core monitoring in Azure Databricks, sending streaming query event information to Azure Monitor. For more information,see Monitoring Azure Databricks using Spark Monitoring Library.

  • What is captured: The library logs Azure Databricks service metrics and Apache Spark structure streaming query event metrics, providing insights into Spark metrics, application-level logging, and performance.

  • How to access: The Spark monitoring library sends logs to a Log Analytics workspace. Tables such as SparkMetric_CL, SparkLoggingEvent_CL, and SparkListenerEvent_CL store information. Grafana provides dashboards for monitoring.

  • When to use: Spark developers benefit from the Spark Monitoring Library for detailed insights into Spark application-specific metrics. There are minor overlaps between Cluster Metrics and Spark Monitoring Library, but this library takes a deeper look into the Spark application-specific metrics. We use both Azure Monitor and the Grafana dashboards to troubleshoot performance issues in Spark Applications.

Creating Alerts

Azure Monitor allows the creation of alert rules based on metrics or log data sources. Alerts play a critical role in proactively identifying and resolving issues related to infrastructure or applications before they have an effect on users. By utilizing Azure Monitor, users can create alerts triggered based on different metrics or log data sources.

Monitoring alerts in Azure Monitor

Alert rules in Azure Monitor are based on visualization criteria, target resources, metrics, and filter dimensions. When alert rules are triggered, users receive notifications via email, SMS, voice calls, or push notifications.

  • When to use: Use alerts for proactive identification and resolution of issues before they impact user experience. Define alert rules based on specific queries, such as error logs or CPU time exceeding a threshold. For example, if notifications of errors in jobs, an alert rule should be defined using the following query:

    | where Level == 'ERROR'

    This query filters the logs in the SparkLoggingEvent_CL table sent by the Spark monitoring Library to only include entries with a log level of ERROR. By creating an alert rule based on this query, when it detects new error logs in the specified table, users receive notifications.

    Another example is monitoring the CPU time of a cluster using the following query:

    | where name_s contains "executor.cpuTime"
    | extend sname = split(name_s, ".")
    | extend executor = strcat(sname[0], ".", sname[1])
    | project TimeGenerated, cpuTime = count_d / 100000

    This query filters the logs in the SparkMetric_CL table to include only entries related to executor CPU time. This query monitors and analyzes the CPU time of the cluster's executors.

In addition, you can create alert rules based on this query to receive notifications when the CPU time exceeds a certain threshold. This rule enables proactive identification and addressing of any issues related to high CPU usage in the Spark cluster.

For more information