Azure Databricks - Timeout error after 60 minutes when launching an Azure Databricks cluster

Question

When I attempt to start a cluster through the Azure Databricks portal/UI, after 30 minutes I receive the following error in the event log:

Failed to add 3 containers to the compute. Will attempt retry: true. Reason: Cloud provider launch failure Azure error message: [id: InstanceId(54b48303e9344528ab7c6219b346908a), status: INSTANCE_LAUNCHING, workerEnvId:WorkerEnvId(workerenv-5859734321987372), lastStatusChangeTime: 1720754572233, groupIdOpt Some(0),requestIdOpt Some(0125-050522-pyr1pe3z-e1454531-81ae-4c47-9),version 0] timeout after 1208958 milliseconds with threshold 1200 seconds

Databricks will attempt to retry starting the cluster again, but this will result in the attempt terminating 30 minutes later after this first failure (with the same error message).

These clusters were working fine last night when they were started by Azure Data Factory jobs. The clusters are configured to use Databricks Runtime Version: 12.2 LTS and Worker/Driver type are both set to Standard_D4ads_v5. The Azure Databricks resources are all premium.

I've attempted recreating a basic clean cluster with no settings edited, as suggested here (https://learn.microsoft.com/en-us/answers/questions/1659188/how-to-fix-timeout-error-when-clreating-compute-cl). I tried to start that cluster but it has also resulted in the same error.. I tried to start that cluster but it has also resulted in the same error.

What could be causing the cluster launch to timeout? Does anyone have any ideas in how I diagnosis the issue?

Answer

@Michael Pugliese - Thanks for the question and using MS Q&A platform.

It seems that you are facing a timeout error while launching an Azure Databricks cluster. This error occurs when the cluster takes more than the specified time to start. The error message indicates that the cluster failed to add 3 containers to the compute and will attempt to retry. However, the attempt will terminate 30 minutes later with the same error message.

There could be several reasons for this error, such as network connectivity issues, insufficient resources, or incorrect configuration settings. To diagnose the issue, you can try the following steps:

Check the network connectivity between your Azure Databricks workspace and the Azure resources it depends on, such as storage accounts, virtual networks, and key vaults.
Verify that you have sufficient resources available in your subscription to launch the cluster. You can check the resource utilization metrics in the Azure portal.
Check the configuration settings of your cluster, such as the number of nodes, the type of nodes, and the runtime version. Ensure that they are compatible with your workload requirements.
Try launching the cluster using the Databricks CLI or REST API instead of the Azure portal. This can help you identify any issues with the portal UI.

If none of these steps resolve the issue, you can share the databricks region for further assistance.

Hope this helps. Do let us know if you any further queries.

Share via

Azure Databricks - Timeout error after 60 minutes when launching an Azure Databricks cluster

1 answer