Microsoft Azure databricks all purpose cluster issue with jobs failure again and again

Question

Jobs within the all_purpose Databricks Cluster are failing with "the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached"

In the event log it says "Event_type=DRIVER_NOT_RESPONDING & MESSAGE= "Driver is up but is not responsive, likely due to GC."

after restarting the cluster the jobs runs fine.

Please help me how to fix this once for all ?

How do monitor my Azure DATABRICKS workspace cluster logs ??

Accepted Answer

Hi Prasant Kumar Das,

Thanks for reaching out to Microsoft Q&A.

The issue may be typically linked to memory management problems in the Spark driver. By tuning the memory and GC settings, improving job logic, and leveraging log monitoring tools, you can stabilize your Databricks clusters and detect issues early.

The error suggests that the Spark driver may be running out of memory, leading to GC issues. Increasing the driver memory can help alleviate this problem. You can adjust the driver memory in the cluster configuration settings.
Review and optimize your Spark configurations, particularly those related to memory management. You can set parameters such as spark.executor.memory, spark.driver.memory, and spark.memory.fraction to better manage memory allocation.
Utilize the Databricks UI to monitor performance metrics and logs. Navigate to the Metrics tab on the compute details page to view real-time metrics and historical data. This can help identify if there are specific jobs or operations that are causing the driver to become unresponsive.
If you are using init scripts, ensure they are executing correctly. Issues in the init scripts can lead to failures in cluster startup and job execution. Review the logs for any errors during the execution of these scripts.
Schedule regular restarts of your clusters to ensure they are running with the latest images and configurations. This can help mitigate issues caused by long-running processes that may lead to memory leaks or performance degradation.
If your workloads are primarily batch jobs, consider using job clusters instead of all-purpose clusters. Job clusters are optimized for running jobs and can provide better reliability and performance for scheduled tasks.You can monitor Databricks logs using the following methods:
Spark UI: Access the Spark UI from the Azure Databricks workspace to track job metrics, memory usage, and driver logs.
Cluster Event Logs: Under your cluster's page in Databricks, you can access detailed event logs that capture issues like driver restarts and failures.
Azure Diagnostic Logs: Enable Azure diagnostic logging for your Databricks workspace
Azure Monitor: With diagnostic logs enabled, you can set up Azure Monitor alerts based on log metrics or custom queries using KQL.

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Share via

Microsoft Azure databricks all purpose cluster issue with jobs failure again and again

0 additional answers

Your answer