Share via


Classic compute termination error codes

This article provides troubleshooting guidance for common cluster termination error codes. Use the error code from your cluster event log to find the relevant cause and recommended fix.

AZURE_OPERATION_NOT_ALLOWED_EXCEPTION

An Azure operation failed due to permission restrictions, policy violations, or account limitations.

Example error message

Encountered error while client authentication.

Troubleshooting steps

  1. Review Azure activity logs for detailed error information.
  2. Check service principal permissions on the resource group.
  3. Verify that Azure policies are not blocking the operation.
  4. Check subscription state and status.
  5. Review recent changes to permissions or policies.

Recommended fix

Grant the required permissions to the service principal, update Azure policies if they are blocking operations, verify that the subscription is active, or resolve any account restrictions. Contact Azure support for assistance with permission configuration.

AZURE_QUOTA_EXCEEDED_EXCEPTION

The cluster launch would exceed the Azure subscription quota limits for the requested VM family.

Example error message

Operation could not be completed as it results in exceeding approved standardDSv2Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 350, Current Usage: 344, Additional Required: 8, (Minimum) New Limit Required: 352.

Troubleshooting steps

  1. Check current quota usage in Azure Portal under Subscriptions > Usage + quotas.
  2. Identify which VM family quota is exceeded.
  3. Review all running VMs and clusters in the subscription.
  4. Check whether recent cluster launches have increased usage.

Recommended fix

Request a quota increase through Azure Portal, terminate unused clusters to free quota, or use different VM types with available quota. Contact Azure support for quota increase assistance.

BOOTSTRAP_TIMEOUT_DUE_TO_MISCONFIG

The VM bootstrap process timed out due to network connectivity issues, slow artifact downloads, or issues with the cloud provider. The bootstrap timeout is 700 seconds.

Example error message

[id: InstanceId([REDACTED]), status: INSTANCE_INITIALIZING, ...] with threshold 700 seconds timed out after 703891 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason

Troubleshooting steps

  1. Check connectivity to Databricks artifact storage.
  2. Verify connectivity to the Databricks control plane.
  3. Check DNS resolution for Databricks endpoints.
  4. Verify firewall and security group rules.
  5. Test whether the issue is consistent or intermittent.

Recommended fix

Ensure network connectivity to Databricks storage and control plane. Configure service endpoints or VPC endpoints for better network performance. Review firewall, DNS, and routing configuration. Contact Databricks support if network configuration is verified, but timeouts persist.

CLOUD_PROVIDER_RESOURCE_STOCKOUT

Azure is temporarily out of capacity for the requested VM size in the selected region or zone.

Example error message

The requested VM size for resource 'Following SKUs have failed for Capacity Restrictions: Standard_DS3_v2' is currently not available in location 'westeurope'. Please try another size or deploy to a different location or different zone. See https://aka.ms/azureskunotavailable for details.

Troubleshooting steps

  1. Check the Azure Service Health dashboard for known capacity issues.
  2. Verify whether the issue affects specific zones or the entire region.
  3. Check whether the issue is specific to your VM size.
  4. Check whether the issue affects spot VMs versus on-demand VMs.

Recommended fix

Retry the cluster launch, try different availability zones, use the auto availability zone setting, try alternative VM sizes, or wait for Azure to restore capacity. Contact Azure support for estimated capacity restoration time if the issue persists.

CLOUD_PROVIDER_LAUNCH_FAILURE

The cloud provider failed to launch the requested VM instance. This is usually a cloud-provider-side issue.

Example error message

Reason: CLOUD_PROVIDER_LAUNCH_FAILURE (CLOUD_FAILURE). Parameters: databricks_error_message:The VM launch timed out. Possible causes include a full subnet, insufficient available public IPs, quota limits being exceeded, cloud provider throttling, resource conflict, or a transient issue with the cloud provider. [details] LAUNCH_TIMEOUT: timeout after 1200073 milliseconds with threshold 1200 seconds(OnDemand), instance_id:[REDACTED]

Troubleshooting steps

  1. Check the azure_error_message in the error parameters for the specific cloud-provider failure.
  1. Check the cloud provider status page for ongoing incidents in your region.
  2. Review quota limits and subnet capacity if the error mentions these.

Recommended fix

Try again later, as most cloud provider launch failures are transient. If the issue still occurs, contact your cloud provider support with the specific detailed error from the details.

COMMUNICATION_LOST

The cluster was terminated because the control plane lost communication with the instance. This may be caused by unexpected instance state, instance termination, or network-level issues where the control plane cannot ping the instance for a prolonged period.

Example error message

Cluster '[REDACTED]' was terminated. Reason: COMMUNICATION_LOST (CLOUD_FAILURE). Parameters: instance_id:[REDACTED], databricks_error_message:Node health check failed.

Troubleshooting steps

  1. Check the network configuration between the Azure Databricks compute plane and the SCC relay endpoint. If there is a firewall or proxy between them, it might block the health check communication. Consult with your network administrator.
  2. Check CPU and memory usage of the node on cluster metrics. If resources are exhausted, the instance may not respond to the health check. Consider using a bigger instance type.
  3. Check with your cloud provider if the instance was terminated or impaired externally (for example, AWS instance retirement, Azure host maintenance).
  4. Review Spark driver and executor logs for errors that may have caused the instance to become unresponsive (for example, OOM or long GC pauses).

Recommended fix

Review firewall and proxy settings with your network administrator. If the error was caused by the cloud provider terminating the instance, try again later. If it was caused by resource exhaustion, consider upgrading to a larger instance type. If the issue persists, contact Databricks support.

CONTROL_PLANE_REQUEST_FAILURE_DUE_TO_MISCONFIG

VMs cannot reach the Azure Databricks control plane due to DNS resolution failures, firewall rules, or network misconfiguration.

Example error message

Network health check reported that instance is unable to reach Databricks Control Plane. Please check that instances have connectivity to the Databricks Control Plane. Instance bootstrap inferred timeout reason: NetworkHealthCheck_CP_Failed

Troubleshooting steps

  1. Decode any Base64-encoded error messages in the cluster event log.
  2. Check DNS settings in your network configuration.
  3. Review firewall rules and network security settings.
  4. Test control plane connectivity from a VM in the same network.
  5. Verify custom DNS servers are functional and reachable.

For Azure environments, also check UDR (User-Defined Routes) configuration and NSG rules on Azure Databricks subnets.

Recommended fix

Verify DNS server configuration and reachability. Ensure firewall rules allow outbound traffic to the Azure Databricks control plane.

If using UDR with a firewall, ensure Azure Databricks service tags route to the internet. Test with Azure DNS (168.63.129.16) temporarily to isolate DNS issues.

Contact Databricks support if the network configuration appears correct, but the issue persists.

DOCKER_IMAGE_PULL_FAILURE

The cluster failed to download the Docker image from the container registry due to network, authentication, or configuration issues.

Example error message

Failed to pull docker image: authentication required

Troubleshooting steps

  1. Verify the Docker image name and tag are correct in the cluster configuration.
  2. Check network connectivity to the container registry from the workspace.
  3. Test registry access from a VM in the same network.
  4. Verify authentication credentials for private registries.
  5. Review node daemon logs for detailed error messages.

Recommended fix

Correct the Docker image configuration and verify authentication credentials. Ensure network rules allow access to the container registry.

For Azure Container Registry (ACR), configure service endpoints in your VNet.

Contact Databricks support if the configuration appears correct, but the issue persists.

DOCKER_IMAGE_TOO_LARGE_FOR_INSTANCE_EXCEPTION

The Docker image size exceeds the available disk space on the selected instance type.

Example error message

Failed to launch container as the docker image is too large for the instance.

Troubleshooting steps

  1. Check the Docker image size.
  2. Review the instance type's disk capacity.
  3. Identify unnecessary layers or files in the Docker image.
  4. Check whether multiple large images are being used.

Recommended fix

Use an instance type with a larger disk capacity, optimize the Docker image by removing unnecessary files and layers, use multi-stage builds to reduce image size, or split functionality across multiple smaller images. Contact Databricks support for assistance with image optimization.

EOS_SPARK_IMAGE

The Databricks Runtime (DBR) version configured for the cluster has reached end of support (EOS).

Example error message

Spark image release__11.0.x-snapshot-cpu-ml-scala2.12__databricks-universe__head__[REDACTED]__format-2 does not exist with exit code 2

Troubleshooting steps

  1. Check the DBR version in the cluster configuration.
  2. Review the Databricks Runtime release notes for EOS dates.
  3. Identify which DBR versions are currently supported.
  4. Check whether notebooks or jobs have DBR version dependencies.

Recommended fix

Update the cluster configuration to use a supported Databricks Runtime version. Review compatibility requirements for libraries and code before deploying to production. Contact Databricks support if you need assistance with DBR migration.

HIVEMETASTORE_CONNECTIVITY_FAILURE

The cluster cannot connect to the Hive Metastore because the required network port 3306 is not open from the classic compute plane to the Azure Databricks control plane.

Example error message

Azure Databricks classic compute plane will require connectivity to the Azure Databricks control plane over port 3306. Error message: Hive metastore connectivity check failed for <hms-url>

Troubleshooting steps

  1. If requiredNsgRules is set to AllRules on your workspaces, confirm that the Network Security Groups (NSGs) on your public or private subnets allow outbound connections from your virtual network to the AzureDatabricks service tag for port 3306.

  2. If you use back-end Private Link, confirm that your private endpoints allow connections on port 3306 if you are using an NSG on the private endpoint.

  3. Verify that all your firewall solutions allow port 3306 on the classic compute plane network of your workspaces.

  4. If you are using user-defined routes (UDRs), verify that you are using Azure Databricks service tags instead of allow-listing individual routes to the Azure Databricks service.

  5. To verify connectivity, start a single-node classic (non-serverless) compute resource using the latest Databricks Runtime version and run the following verification script in a notebook:

    import subprocess
    
    workspace_url = spark.conf.get("spark.databricks.workspaceUrl")
    host = subprocess.run(
        f"dig +short {workspace_url} | tail -n1",
        shell=True, capture_output=True, text=True
    ).stdout.strip()
    
    # Port 3306 check
    r1 = subprocess.run(
        f"nc -w2 -vz {host} 3306",
        shell=True, capture_output=True, text=True
    )
    if r1.returncode == 0:
        print("Port 3306 Connectivity: Success")
    else:
        print(f"Port 3306 Connectivity: Failure - check NSG/firewall/UDR configuration\n{r1.stdout}{r1.stderr}")
    
    # Metastore connectivity check
    r2 = subprocess.run(
        "openssl s_client -starttls mysql -tls1_2 -ignore_unexpected_eof "
        "-connect consolidated-westus-prod-metastore.mysql.database.azure.com:9207",
        shell=True, capture_output=True, text=True
    )
    if r2.returncode == 0:
        print("Metastore Connectivity: Success")
    else:
        print(f"Metastore Connectivity: Failure - check metastore connectivity\n{r2.stdout}{r2.stderr}")
    

    Both checks must show "Success". If either fails, review your NSG, firewall, and UDR configuration.

Recommended fix

Update your network configuration to allow outbound connectivity from the classic compute plane to the Azure Databricks control plane on ports 3306 and rerun the validation script to confirm connectivity.

If you've completed your Unity Catalog migration or federated your Hive metastore as a foreign catalog that is governed by Unity Catalog, you can now Disable access to the Hive metastore.

Contact your Azure Databricks account team or file a support request if the issue persists after verifying the network configuration.

INSTANCE_POOL_MAX_CAPACITY_REACHED

The instance pool has reached its configured maximum capacity limit and cannot provide additional instances.

Example error message

Instance pool is full, please consider increasing the pool size

Troubleshooting steps

  1. Check the instance pool configuration for the maximum capacity setting.
  2. Review how many instances are currently in use from the pool.
  3. Identify which clusters are using the pool.
  4. Check whether there are idle instances that can be freed.

Recommended fix

Increase the instance pool maximum capacity, create additional instance pools to distribute load, terminate idle clusters using the pool, or configure clusters to use different pools. Review pool sizing based on concurrent workload requirements.

INSTANCE_UNREACHABLE_DUE_TO_MISCONFIG

Instances are unreachable due to network misconfiguration, firewall rules, or connectivity issues.

Example error message

Bootstrap completes in the VM but control plane failed to reach the node. Please review your network configuration or firewall settings to allow Databricks to reach the node.

Troubleshooting steps

  1. Review firewall rules and network security settings for required inbound ports.
  2. Test connectivity from the control plane to the instance network.
  3. Check for asymmetric routing issues.
  4. Review firewall logs for dropped connections.
  5. Verify that instances have the correct security group assignments.

For Azure, also check NSG inbound rules on Azure Databricks subnets and verify that UDR route tables are correctly configured.

Recommended fix

Ensure security groups or NSGs allow required inbound traffic from the Azure Databricks control plane. Verify that route tables enable bidirectional communication. Contact Databricks support for assistance with network connectivity troubleshooting.

INVALID_ARGUMENT

Invalid configuration parameters, missing secrets, incorrect permissions, or misconfigured cluster settings prevented the cluster from starting.

Example error message

com.databricks.backend.manager.secret.SecretPermissionDeniedException: User does not have permission with scope: [REDACTED] and key: [REDACTED]

Troubleshooting steps

  1. Review the error message to identify the specific invalid parameter.
  2. For secret errors, verify the secret scope and key exist using the Databricks Secrets API.
  3. Check user or service principal permissions for accessing secrets.
  4. Review the cluster configuration for syntax errors.
  5. Check init scripts for configuration errors.

For Azure Key Vault-backed secrets, also verify network connectivity and DNS resolution to the Key Vault endpoint.

Recommended fix

Correct the invalid parameter based on the error message. For secrets, verify scope and key existence, check permissions, and ensure network connectivity to secret providers. Validate all cluster configuration against the documentation. Contact Databricks support if the configuration appears correct.

NETWORK_CHECK_CONTROL_PLANE_FAILURE

A pre-bootstrap network health check failed when attempting to reach the Azure Databricks control plane.

Example error message

Instance failed network health check before bootstrapping with fatal error: X_NHC_CONTROL_PLANE_UNREACHABLE
1 failed component(s): control_plane
Retryable: true

Troubleshooting steps

  1. Review cluster event logs for specific connection failure details.
  2. Test control plane connectivity from a VM in the same network.
  3. Check whether a firewall is intercepting or blocking traffic.

For Azure, also check NSG outbound rules on Azure Databricks subnets and verify UDR configuration. If using a firewall, ensure Azure Databricks service tags route to the internet.

Recommended fix

Verify that security group or NSG rules allow outbound traffic to the Azure Databricks control plane. If using UDR with a firewall, ensure Azure Databricks service tags route to the internet. Contact Databricks support if network configuration is verified correct.

NETWORK_CONFIGURATION_FAILURE

A network configuration error is preventing proper VM or cluster network setup.

Example error message

Instance bootstrap failed command: AzureInvalidNic
Failure message: Azure instance did not set up route table correctly

Troubleshooting steps

  1. Review firewall and security group or NSG rules.
  2. Check route tables and routing configuration.
  3. Verify subnet configuration.
  4. Check for IP address conflicts.
  5. Verify DNS settings.

Recommended fix

Correct the network configuration based on the specific error. Ensure security group or NSG rules allow required traffic, verify that subnet CIDR ranges don't overlap, check that route tables are properly configured, and ensure DNS is functional. Contact Databricks support for network configuration review.

NPIP_TUNNEL_SETUP_FAILURE

The bootstrap script failed to set up the NPIP tunnel connection within the timeout. This occurs after the cloud provider launches the instance and the bootstrap script attempts to establish the SCC relay tunnel.

Example error message

Cluster '[REDACTED]' was terminated. Reason: NPIP_TUNNEL_SETUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:VM setup failed due to Ngrok setup timeout. [details] NPIP_TUNNEL_SETUP_FAILURE: Instance bootstrap failed command: WaitForNgrokTunnel Failure message: Timed out waiting for ngrok tunnel to be up(OnDemand), instance_id:[REDACTED]

Troubleshooting steps

  1. Check the network configuration between the SCC relay and the Azure Databricks compute plane subnets.
  2. Review firewall and proxy settings that might block tunnel setup on port 443 or 6666.

Recommended fix

Ensure network connectivity from the compute plane to the SCC relay endpoint. Launch an instance in the Azure Databricks compute plane VPC/VNet and check connectivity to the SCC relay:

nslookup <SCC relay fqdn>
nc -vz <SCC relay fqdn> 443

If there is a firewall or proxy, verify it allows traffic to the relay on the required ports. Consult the public network configuration docs and ensure you have the right egress rules set up and can connect to SCC endpoint from your VPC/VNet. If the issue occurs even though there is no problem in your network configuration, contact Databricks support.

REQUEST_THROTTLED

API requests to the cloud provider are being throttled due to rate limiting.

Example error message

TEMPORARILY_UNAVAILABLE: Too many requests from workspace [REDACTED]

Troubleshooting steps

  1. Check whether multiple clusters are launching simultaneously.
  2. Review API request rate limits for your account.
  3. Identify whether other services are making concurrent API calls.
  4. Check for automated systems making frequent requests.

Recommended fix

Reduce concurrent cluster launches, request an API rate limit increase from your cloud provider, implement exponential backoff in automation scripts, or stagger cluster launch times.

SPOT_INSTANCE_TERMINATION

Spot or preemptible instances were terminated by the cloud provider due to capacity needs or pricing changes.

Example error message

Server.SpotInstanceTermination: Spot instance termination

Troubleshooting steps

  1. Check the cluster event logs for the termination timestamp.
  2. Review spot pricing history in your region.
  3. Identify whether terminations occur at specific times.
  4. Check whether multiple instances were terminated simultaneously.

Recommended fix

Switch to on-demand instances for production workloads, implement job retry logic to handle interruptions, or use a mix of on-demand and spot instances. Spot instances are best for fault-tolerant workloads.

SPARK_STARTUP_FAILURE

The Spark driver failed to start within the configured timeout. This may occur when the driver daemon startup was not completed within the timeout (typically 200 seconds) on the cluster driver instance.

Example error messages

Cluster '[REDACTED]' was terminated. Reason: SPARK_STARTUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:Spark failed to start: DEADLINE_EXCEEDED.
Cluster '[REDACTED]' was terminated. Reason: SPARK_STARTUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:Spark failed to start: Timed out after 200 seconds.

Troubleshooting steps

  1. Review Spark configuration for misconfigurations (for example, invalid metastore URI or conflicting settings).
  2. Check your init scripts for potential errors that could delay or prevent driver startup.

Recommended fix

Remove custom Spark configs and init scripts to isolate the issue. Try a different instance type, as hardware slowness on smaller instances can cause driver startup timeouts. If the issue persists, contact Databricks support with the cluster ID and error details.

STORAGE_DOWNLOAD_FAILURE_SLOW

Downloading artifacts from Azure Databricks storage is failing or too slow due to network connectivity, firewall, or DNS issues.

Example error message

Instance bootstrap failed command: Command_UpdateWorker
Failure message: Trying DNS probe for: https://[REDACTED].blob.core.windows.net/update/worker-artifacts/...

Troubleshooting steps

  1. Check firewall rules for Azure Databricks storage endpoints.
  2. Verify DNS resolution for storage URLs.
  3. Test download speed from a VM in the same network.
  4. Review network bandwidth utilization.
  5. Check for proxy or network inspection devices.
  6. Verify routes to storage endpoints.

Recommended fix

Ensure firewall rules allow access to Azure Databricks storage endpoints.

Configure service endpoints for Azure Storage in your VNet.

Review and optimize network inspection devices if present. Contact Databricks support if connectivity to storage endpoints is verified but downloads still fail.

SUBNET_EXHAUSTED_FAILURE

The Azure subnet has run out of available IP addresses. Each Azure Databricks instance requires one IP address in both the private and public subnets.

Example error message

Subnet /subscriptions/[REDACTED]/resourceGroups/[REDACTED]/providers/Microsoft.Network/virtualNetworks/VNET_DATA01_DEV/subnets/SN_DATABRICKS_CORE_PUBLIC with address prefix [REDACTED]/26 does not have enough capacity for 1 IP addresses.

Troubleshooting steps

  1. Check the subnet CIDR range and available addresses in Azure Portal.
  2. Review the number of network interfaces in the subnet.
  3. Verify subnet configuration for the Azure Databricks workspace.
  4. Calculate IP requirements: (number of nodes) × (2 subnets) = total IPs needed.

Recommended fix

Use fewer, larger instances to reduce IP consumption, clean up unused resources in the subnet, or create a new workspace with larger subnet CIDR ranges. See Update workspace network configuration for more information on updating workspace subnet configuration. If changing subnet configuration is not possible, contact Databricks support for workspace migration assistance.

WORKSPACE_CONFIGURATION_ERROR

Workspace-level misconfiguration is preventing cluster launch, including issues with IAM roles or service principal permissions.

Troubleshooting steps

  1. Review recent changes to workspace configuration.
  2. Check the cloud provider console for policy or permission changes.
  1. Verify service principal permissions on all required resource groups.

Recommended fix

Verify that the service principal has the required permissions across all resource groups. Review workspace security configuration.

Contact Databricks support if the workspace configuration appears correct or if the cross-account role setup needs verification.