Compute configuration reference

This article explains all the configuration settings available in the Create Compute UI. Most users create compute using their assigned policies, which limits the configurable settings. If you don’t see a particular setting in your UI, it’s because the policy you’ve selected does not allow you to configure that setting.

The configurations and management tools described in this article apply to both all-purpose and job compute. For more considerations on configuring job compute, see Use Azure Databricks compute with your jobs.

Policies

Policies are a set of rules used to limit the configuration options available to users when they create compute. If a user doesn’t have the Unrestricted cluster creation entitlement, then they can only create compute using their granted policies.

To create compute according to a policy, select a policy from the Policy drop-down menu.

By default, all users have access to the Personal Compute policy, allowing them to create single-machine compute resources. If you need access to Personal Compute or any additional policies, reach out to your workspace admin.

Single-node or multi-node compute

Depending on the policy, you can select between creating a Single node compute or a Multi node compute.

Single node compute is intended for jobs that use small amounts of data or non-distributed workloads such as single-node machine learning libraries. Multi-node compute should be used for larger jobs with distributed workloads.

Single node properties

A single node compute has the following properties:

  • Runs Spark locally.
  • Driver acts as both master and worker, with no worker nodes.
  • Spawns one executor thread per logical core in the compute, minus 1 core for the driver.
  • Saves all stderr, stdout, and log4j log outputs in the driver log.
  • Can’t be converted to a multi-node compute.

Selecting single or multi node

Consider your use case when deciding between a single or multi-node compute:

  • Large-scale data processing will exhaust the resources on a single node compute. For these workloads, Databricks recommends using a multi-node compute.

  • Single-node compute is not designed to be shared. To avoid resource conflicts, Databricks recommends using a multi-node compute when the compute must be shared.

  • A multi-node compute can’t be scaled to 0 workers. Use a single node compute instead.

  • Single-node compute is not compatible with process isolation.

  • GPU scheduling is not enabled on single node compute.

  • On single-node compute, Spark cannot read Parquet files with a UDT column. The following error message results:

    The Spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
    

    To work around this problem, disable the native Parquet reader:

    spark.conf.set("spark.databricks.io.parquet.nativeReader.enabled", False)
    

Access modes

Access mode is a security feature that determines who can use the compute and what data they can access via the compute. Every compute in Azure Databricks has an access mode.

Databricks recommends that you use shared access mode for all workloads. Only use the single user access mode if your required functionality is not supported by shared access mode.

Access Mode Visible to user UC Support Supported Languages Notes
Single user Always Yes Python, SQL, Scala, R Can be assigned to and used by a single user. Referred to as Assigned access mode in some workspaces.
Shared Always (Premium plan required) Yes Python (on Databricks Runtime 11.3 LTS and above), SQL, Scala (on Unity Catalog-enabled compute using Databricks Runtime 13.3 LTS and above) Can be used by multiple users with data isolation among users.
No Isolation Shared Admins can hide this access mode by enforcing user isolation in the admin settings page. No Python, SQL, Scala, R There is a related account-level setting for No Isolation Shared compute.
Custom Hidden (For all new compute) No Python, SQL, Scala, R This option is shown only if you have existing compute without a specified access mode.

You can upgrade an existing compute to meet the requirements of Unity Catalog by setting its access mode to Single User or Shared.

Note

In Databricks Runtime 13.3 LTS and above, init scripts and libraries are supported on all access modes. Requirements and support vary. See Where can init scripts be installed? and Cluster-scoped libraries.

Databricks Runtime versions

Databricks Runtime is the set of core components that run on your compute. Select the runtime using the Databricks Runtime Version drop-down menu. For details on specific Databricks Runtime versions, see Databricks Runtime release notes versions and compatibility. All versions include Apache Spark. Databricks recommends the following:

  • For all-purpose compute, use the most current version to ensure you have the latest optimizations and the most up-to-date compatibility between your code and preloaded packages.
  • For job compute running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your workload before upgrading.
  • For data science and machine learning use cases, consider Databricks Runtime ML version.

Use Photon acceleration

Photon is enabled by default on compute running Databricks Runtime 9.1 LTS and above.

To enable or disable Photon acceleration, select the Use Photon Acceleration checkbox. To learn more about Photon, see What is Photon?.

Worker and driver node types

Compute consists of one driver node and zero or more worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads.

You can also select a pool to use as the worker or driver node. See What are Azure Databricks pools?.

Worker type

In multi-node compute, worker nodes run the Spark executors and other services required for proper functioning compute. When you distribute your workload with Spark, all the distributed processing happens on worker nodes. Azure Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture.

Tip

To run a Spark job, you need at least one worker node. If the compute has zero workers, you can run non-Spark commands on the driver node, but Spark commands will fail.

Worker node IP addresses

Azure Databricks launches worker nodes with two private IP addresses each. The node’s primary private IP address hosts Azure Databricks internal traffic. The secondary private IP address is used by the Spark container for intra-cluster communication. This model allows Azure Databricks to provide isolation between multiple compute in the same workspace.

Driver type

The driver node maintains state information of all notebooks attached to the compute. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the compute, and runs the Apache Spark master that coordinates with the Spark executors.

The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

Tip

Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver node.

GPU instance types

For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports compute accelerated with graphics processing units (GPUs). For more information, see GPU-enabled compute.

Azure confidential computing VMs

Azure confidential computing VM types prevent unauthorized access to data while it’s in use, including from the cloud operator. This VM type is beneficial to highly regulated industries and regions, as well as businesses with sensitive data in the cloud. For more information on Azure’s confidential computing, see Azure confidential computing.

To run your workloads using Azure confidential computing VMs, select from the DC or EC series VM types in the worker and driver node dropdowns. See Azure Confidential VM options.

Spot instances

To save cost, you can choose to use spot instances, also known as Azure Spot VMs by checking the Spot instances checkbox.

Configure spot

The first instance will always be on-demand (the driver node is always on-demand) and subsequent instances will be spot instances.

If instances are evicted due to unavailability, Azure Databricks will attempt to acquire new spot instances to replace the evicted instances. If spot instances can’t be acquired, on-demand instances are deployed to replace the evicted instances. Additionally, when new nodes are added to existing compute, Azure Databricks will attempt to acquire spot instances for those nodes.

Enable autoscaling

When Enable autoscaling is checked, you can provide a minimum and maximum number of workers for the compute. Databricks then chooses the appropriate number of workers required to run your job.

To set the minimum and the maximum number of workers your compute will autoscale between, use the Min workers and Max workers fields next to the Worker type dropdown.

If you don’t enable autoscaling, you will enter a fixed number of workers in the Workers field next to the Worker type dropdown.

Note

When the compute is running, the compute details page displays the number of allocated workers. You can compare number of allocated workers with the worker configuration and make adjustments as needed.

Benefits of autoscaling

With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high utilization because you don’t need to provision the compute to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned compute.
  • Autoscaling can reduce overall costs compared to a statically-sized compute.

Depending on the constant size of the compute and the workload, autoscaling gives you one or both of these benefits at the same time. The compute size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Note

Autoscaling is not available for spark-submit jobs.

Note

Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling.

How autoscaling behaves

Workspace in the Premium and Enterprise pricing plans use optimized autoscaling. Workspaces on the standard pricing plan use standard autoscaling.

Optimized autoscaling has the following characteristics:

  • Scales up from min to max in 2 steps.
  • Can scale down, even if the compute is not idle, by looking at the shuffle file state.
  • Scales down based on a percentage of current nodes.
  • On job compute, scales down if the compute is underutilized over the last 40 seconds.
  • On all-purpose compute, scales down if the compute is underutilized over the last 150 seconds.
  • The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often the compute makes down-scaling decisions. Increasing the value causes the compute to scale down more slowly. The maximum value is 600.

Standard autoscaling is used in standard plan workspaces. Standard autoscaling has the following characteristics:

  • Starts with adding 8 nodes. Then scales up exponentially, taking as many steps as required to reach the max.
  • Scales down when 90% of the nodes are not busy for 10 minutes and the compute has been idle for at least 30 seconds.
  • Scales down exponentially, starting with 1 node.

Autoscaling with pools

If you are attaching your compute to a pool, consider the following:

  • Make sure the compute size requested is less than or equal to the minimum number of idle instances in the pool. If it is larger, compute startup time will be equivalent to compute that doesn’t use a pool.
  • Make sure the maximum compute size is less than or equal to the maximum capacity of the pool. If it is larger, the compute creation will fail.

Autoscaling example

If you reconfigure a static compute to autoscale, Azure Databricks immediately resizes the compute within the minimum and maximum bounds and then starts autoscaling. As an example, the following table demonstrates what happens to compute with a certain initial size if you reconfigure the compute to autoscale between 5 and 10 nodes.

Initial size Size after reconfiguration
6 6
12 10
3 5

Enable autoscaling local storage

It can often be difficult to estimate how much disk space a particular job will take. To save you from having to estimate how many gigabytes of managed disk to attach to your compute at creation time, Azure Databricks automatically enables autoscaling local storage on all Azure Databricks compute.

With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your compute’s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new managed disk to the worker before it runs out of disk space. Disks are attached up to a limit of 5 TB of total disk space per virtual machine (including the virtual machine’s initial local storage).

The managed disks attached to a virtual machine are detached only when the virtual machine is returned to Azure. That is, managed disks are never detached from a virtual machine as long as they are part of a running compute. To scale down managed disk usage, Azure Databricks recommends using this feature in compute configured with autoscaling compute or automatic termination.

Local disk encryption

Important

This feature is in Public Preview.

Some instance types you use to run compute may have locally attached disks. Azure Databricks may store shuffle data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage types, including shuffle data that is stored temporarily on your compute’s local disks, you can enable local disk encryption.

Important

Your workloads may run more slowly because of the performance impact of reading and writing encrypted data to and from local volumes.

When local disk encryption is enabled, Azure Databricks generates an encryption key locally that is unique to each compute node and is used to encrypt all data stored on local disks. The scope of the key is local to each compute node and is destroyed along with the compute node itself. During its lifetime, the key resides in memory for encryption and decryption and is stored encrypted on the disk.

To enable local disk encryption, you must use the Clusters API. During compute creation or edit, set enable_local_disk_encryption to true.

Tags

Tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. Specify tags as key-value pairs when you create compute, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.

For compute launched from pools, the custom tags are only applied to DBU usage reports and do not propagate to cloud resources.

For detailed information about how pool and compute tag types work together, see Monitor usage using tags

To add tags to your compute:

  1. In the Tags section, add a key-value pair for each custom tag.
  2. Click Add.

Spark configuration

To fine-tune Spark jobs, you can provide custom Spark configuration properties.

  1. On the compute configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

    Spark configuration

    In Spark config, enter the configuration properties as one key-value pair per line.

When you configure compute using the Clusters API, set Spark properties in the spark_conf field in the create cluster API or Update cluster API.

To enforce Spark configurations on compute, workspace admins can use compute policies.

Retrieve a Spark configuration property from a secret

Databricks recommends storing sensitive information, such as passwords, in a secret instead of plaintext. To reference a secret in the Spark configuration, use the following syntax:

spark.<property-name> {{secrets/<scope-name>/<secret-name>}}

For example, to set a Spark configuration property called password to the value of the secret stored in secrets/acme_app/password:

spark.password {{secrets/acme-app/password}}

For more information, see Syntax for referencing secrets in a Spark configuration property or environment variable.

SSH access to compute

For security reasons, in Azure Databricks the SSH port is closed by default. If you want to enable SSH access to your Spark clusters, see SSH to the driver node.

Note

SSH can be enabled only if your workspace is deployed in your own Azure virtual network.

Environment variables

Configure custom environment variables that you can access from init scripts running on the compute. Databricks also provides predefined environment variables that you can use in init scripts. You cannot override these predefined environment variables.

  1. On the compute configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

  3. Set the environment variables in the Environment Variables field.

    Environment Variables field

You can also set environment variables using the spark_env_vars field in the Create cluster API or Update cluster API.

Compute log delivery

When you create compute, you can specify a location to deliver the logs for the Spark driver node, worker nodes, and events. Logs are delivered every five minutes and archived hourly in your chosen destination. When a compute is terminated, Azure Databricks guarantees to deliver all logs generated up until the compute was terminated.

The destination of the logs depends on the compute’s cluster_id. If the specified destination is dbfs:/cluster-log-delivery, compute logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375.

To configure the log delivery location:

  1. On the compute page, click the Advanced Options toggle.
  2. Click the Logging tab.
  3. Select a destination type.
  4. Enter the compute log path.

Note

This feature is also available in the REST API. See the Clusters API.