Databricks multinode clusters on one subscription cannot perform any Spark operations (executors lost, quota errors, infinite waits)

Question

Databricks multinode clusters on one subscription cannot perform any Spark operations (executors lost, quota errors, infinite waits)

Bartosz Wilmowicz 0

We are experiencing a critical issue on a single Azure subscription where Databricks multinode clusters are unable to run any Spark operations such as display() or count() on external tables. The same workloads, configurations, and external locations work correctly on two other subscriptions.
The symptoms:

Any Spark action (e.g., df.count()) runs indefinitely.

Standard Python code executes normally; only PySpark operations hang.

Event logs consistently show executor loss and VM launch failures.

Example events from the affected clusters:

{
  "current_num_workers": 1,
  "target_num_workers": 2,
  "reason": {
    "code": "COMMUNICATION_LOST",
    "type": "CLOUD_FAILURE",
    "parameters": {
      "instance_id": "8bb1bdd8da024833b7d5321cc26ee4a3",
      "databricks_error_message": "The instance was detected with a lost executor. This usually stems from issues where the networking rules weren't set properly. Please double check your networking configurations to ensure they are correct."
    }
  }
}

And additional failures:

{
  "reason": {
    "code": "UNEXPECTED_LAUNCH_FAILURE",
    "type": "SERVICE_FAULT",
    "parameters": {
      "databricks_error_message": "The VM launch failed due to transient cloud provider error. LOST_EXECUTOR_DETECTED",
      "azure_error_code": "LOST_EXECUTOR_DETECTED"
    }
  }
}

Key observations:

Single-node clusters work.

Multinode clusters fail to start properly and lose executors.

External tables are stored on Azure external locations (same pattern as on working subscriptions).

Other subscriptions with identical architecture do not exhibit this problem.

Manoj Kumar Boyini 1,250 Reputation points Microsoft External Staff Moderator

2025-12-04T11:04:59.7266667+00:00

Hi Bartosz Wilmowicz,

Thank you for reaching out to Microsoft QA. From what you've described single-node clusters working fine but multinode clusters failing with executor loss and VM launch errors this usually points to limitations or restrictions at the subscription level.

Common causes include:

1.Exceeded compute quotas: Your subscription might have hit the limit on available vCPUs for the VM sizes your cluster workers require

2.Subnet IP address exhaustion: The Databricks subnet might be running low on available IP addresses for worker VMs

3.Network restrictions: Security group rules, routing, or policies might be blocking essential communication between cluster nodes or to the Databricks control plane

Please check the Azure Activity Log for any VM provisioning failures or quota-related errors that might indicate resource allocation issues. Review your subscription's quotas by navigating to Usage and Quotas and verifying the available vCPU limits for the VM family used by your clusters. Additionally, ensure that your subnet has sufficient IP addresses available and that the network security rules allow proper communication between cluster nodes as well as outbound HTTPS traffic on port 443 to the Databricks control plane.

Helpful References:

Databricks Networking Best Practices
Troubleshooting Executor Loss
Azure Quota Management
Azure Activity Log Guide

Hope this help. Please let us know if you have any questions and concerns.

Your answer

Manoj Kumar Boyini 1,250 Reputation points Microsoft External Staff Moderator

2025-12-04T11:04:59.7266667+00:00

Hi Bartosz Wilmowicz,

Thank you for reaching out to Microsoft QA. From what you've described single-node clusters working fine but multinode clusters failing with executor loss and VM launch errors this usually points to limitations or restrictions at the subscription level.

Common causes include:

1.Exceeded compute quotas: Your subscription might have hit the limit on available vCPUs for the VM sizes your cluster workers require

2.Subnet IP address exhaustion: The Databricks subnet might be running low on available IP addresses for worker VMs

3.Network restrictions: Security group rules, routing, or policies might be blocking essential communication between cluster nodes or to the Databricks control plane

Please check the Azure Activity Log for any VM provisioning failures or quota-related errors that might indicate resource allocation issues. Review your subscription's quotas by navigating to Usage and Quotas and verifying the available vCPU limits for the VM family used by your clusters. Additionally, ensure that your subnet has sufficient IP addresses available and that the network security rules allow proper communication between cluster nodes as well as outbound HTTPS traffic on port 443 to the Databricks control plane.

Helpful References:

Databricks Networking Best Practices
Troubleshooting Executor Loss
Azure Quota Management
Azure Activity Log Guide

Hope this help. Please let us know if you have any questions and concerns.

Share via

Databricks multinode clusters on one subscription cannot perform any Spark operations (executors lost, quota errors, infinite waits)

Your answer