Databricks Cluster Error:- Run failed with error message Could not reach driver of cluster

Meet Vanani 20 Reputation points
2025-05-16T00:15:35.87+00:00

All of my notebooks fail at the start of the notebook with the error message:-

Run failed with error message Could not reach driver of cluster

First, I thought it is because of my vCPU Quota then It still fails even though I increased the quota from 36 to 40. I don't what is causing this issue

Any help?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

Accepted answer
  1. Krupal Bandari 770 Reputation points Microsoft External Staff Moderator
    2025-05-16T05:59:02.5766667+00:00

    Hi @Meet Vanani ,

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this!

    Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to "Accept " the answer.

    Issue:

    Databricks Cluster Error: - Run failed with error message Could not reach driver of cluster

    Solution:

    After ruling out quotas, network, and VM availability, we discovered the driver was crashing on startup due to a binary mismatch between cluster-installed NumPy/Pandas wheels and the Databricks runtime’s built-in C extensions. The quick fix was to uninstall any custom numpy/pandas at the cluster level and then reinstall the exact versions that ship with our runtime (e.g. numpy==1.23.5 and pandas==2.0.3) via an init script. Once the mismatched wheels were removed, the driver spun up normally, the REPL attached successfully, and our notebooks ran without the “Could not reach driver” error.

    Please click Accept Answer and kindly upvote it so that other people who faces similar issue may get benefitted from it.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Sina Salam 22,031 Reputation points Volunteer Moderator
    2025-05-17T16:52:43.78+00:00

    Hello Meet Vanani,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    Based on all evidence, the Databricks workspace driver node is likely unreachable due to outbound network configuration issues in a private VNet setup.

    To resolve confusion first, you believe allowing NSG outbound TCP 443 = internet access, is incorrect without a route to the internet. Explicitly, a private subnet without NAT or Azure Firewall will not have internet access, even if NSG rules allow traffic.

    Based on the facts, to resolve all the issue:

    • Attach a NAT gateway to the subnet: https://learn.microsoft.com/en-us/azure/virtual-network/nat-gateway/nat-overview :
      • Go to Azure Portal > Networking > NAT Gateways.
      • Create a NAT Gateway in the same region.
      • Attach it to the subnet where the Databricks cluster is deployed.
    • Since, you defined route (UDR) for 0.0.0.0/0 without an internet next hop (like Internet or NAT) will block outbound traffic. - https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview verify your route table:
      • Navigate to Route Tables in the Azure Portal.
      • Confirm if there’s a UDR with 0.0.0.0/0.
        • If yes, check the next hop type.
        • If None, traffic is dropped.
        • If Virtual Appliance, ensure it routes correctly.
        • Ideally: use NAT Gateway or set next hop to Internet (for test).
        • If no UDR: Azure uses default system routes, which are fine with NAT.
    • You will need to validate DNS resolution because, Databricks depends on control plane DNS especially, .azuredatabricks.net DNS failure can prevent the cluster driver from communicating. - https://learn.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/udr#dns-considerations :
      • From a VM in the same subnet, run:
              nslookup <your-workspace>.azuredatabricks.net
        
        • If it fails, you may be using custom DNS or missing a Private DNS zone for Private Link.
        • If using Private Link: check DNS zone privatelink.azuredatabricks.net is linked and resolving properly.
    • This is optional but very valuable, you can isolate test, it helps confirm if issue is specific to the VNet or a platform issue.
      • Deploy a new Databricks workspace in a default VNet (without custom routing or DNS).
      • Test if notebook launches successfully.
      • If yes > confirms your original VNet is misconfigured.
    • After implementing all of the above:
      • Run curl https://<workspace>.azuredatabricks.net/api/2.0/workspace/get-status again from the VM.
      • If it responds, then issue resolved.
      • Launch a test cluster in the original workspace.
      • Monitor if the driver node becomes reachable.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.