Databricks multinode clusters on one subscription cannot perform any Spark operations (executors lost, quota errors, infinite waits)
We are experiencing a critical issue on a single Azure subscription where Databricks multinode clusters are unable to run any Spark operations such as display() or count() on external tables. The same workloads, configurations, and external locations work correctly on two other subscriptions.
The symptoms:
Any Spark action (e.g., df.count()) runs indefinitely.
Standard Python code executes normally; only PySpark operations hang.
Event logs consistently show executor loss and VM launch failures.
Example events from the affected clusters:
{
"current_num_workers": 1,
"target_num_workers": 2,
"reason": {
"code": "COMMUNICATION_LOST",
"type": "CLOUD_FAILURE",
"parameters": {
"instance_id": "8bb1bdd8da024833b7d5321cc26ee4a3",
"databricks_error_message": "The instance was detected with a lost executor. This usually stems from issues where the networking rules weren't set properly. Please double check your networking configurations to ensure they are correct."
}
}
}
And additional failures:
{
"reason": {
"code": "UNEXPECTED_LAUNCH_FAILURE",
"type": "SERVICE_FAULT",
"parameters": {
"databricks_error_message": "The VM launch failed due to transient cloud provider error. LOST_EXECUTOR_DETECTED",
"azure_error_code": "LOST_EXECUTOR_DETECTED"
}
}
}
Key observations:
Single-node clusters work.
Multinode clusters fail to start properly and lose executors.
External tables are stored on Azure external locations (same pattern as on working subscriptions).
Other subscriptions with identical architecture do not exhibit this problem.