Create a cluster

Note

These instructions are for the updated create cluster UI. To switch to the legacy create cluster UI, click UI Preview at the top of the create cluster page and toggle the setting to off. For documentation on the legacy UI, see Configure clusters. For a comparison of the new and legacy cluster types, see Clusters UI changes and cluster access modes.

This article explains the configuration options available when you create and edit Azure Databricks clusters. It focuses on creating and editing clusters using the UI. For other methods, see Clusters CLI, Clusters API 2.0, and Databricks Terraform provider.

The cluster creation user interface lets you choose the cluster configuration specifics, including:

Access the cluster creation interface

To create a cluster using the user interface, either click the Create Cluster button in the Compute section or click New > Cluster in your workspace’s side navigation.

Cluster configuration UI

Note

You can also use the Azure Databricks Terraform provider to create a cluster.

Cluster policy

Cluster policies are a set of rules used to limit the configuration options available to users when they create a cluster. Cluster policies have ACLs that regulate which specific users and groups have access to certain policies.

By default, all users have access to the Personal Compute policy, allowing them to easily create single-machine compute resources. If you don’t see the Personal Compute policy as an option when you create a cluster, then you haven’t been given access to the policy. Please contact your administrator to request access to the Personal Compute policy or an appropriate equivalent policy.

To configure a cluster according to a policy, select a cluster policy from the Policy dropdown.

What is cluster access mode?

The Access mode dropdown has replaced the Security mode dropdown. Access modes are standardized as follows:

Access mode dropdown Visible to user Unity Catalog support Supported languages
Single user Always Yes Python, SQL, Scala, R
Shared Always (Premium plan required) Yes Python, SQL
No isolation shared Admins can hide this cluster type by enforcing user isolation in the admin console. Also see a related account-level setting for No Isolation Shared clusters. No Python, SQL, Scala, R
Custom This option will only be shown for existing clusters without access modes. If a cluster was created with the legacy cluster modes, for example Standard or High Concurrency, Databricks shows this value for access mode when you are using the new UI. This value is not an option for creating new clusters. No Python, SQL, Scala, R

Important

Access mode in the Clusters API is not yet supported.

Databricks Runtime version

Databricks Runtimes are the set of core components that run on your clusters. All Databricks Runtimes include Apache Spark and add components and updates that improve usability, performance, and security. For details, see Databricks runtimes.

Azure Databricks offers several types of runtimes and several versions of those runtime types. You select the cluster’s runtime using the Databricks Runtime Version dropdown when you create or edit a cluster.

Cluster node type

A cluster consists of one driver node and zero or more worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads.

Driver node

The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors.

The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

Tip

Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver node.

Worker node

Azure Databricks worker nodes run the Spark executors and other services required for proper functioning clusters. When you distribute your workload with Spark, all the distributed processing happens on worker nodes. Azure Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture.

Tip

To run a Spark job, you need at least one worker node. If a cluster has zero workers, you can run non-Spark commands on the driver node, but Spark commands will fail.

Worker node IP addresses

Azure Databricks launches worker nodes with two private IP addresses each. The node’s primary private IP address hosts Azure Databricks internal traffic. The secondary private IP address is used by the Spark container for intra-cluster communication. This model allows Azure Databricks to provide isolation between multiple clusters in the same workspace.

GPU instance types

For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports clusters accelerated with graphics processing units (GPUs). For more information, see GPU-enabled clusters.

Cluster size and autoscaling

When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.

When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.

With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned cluster.
  • Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Note

Autoscaling is not available for spark-submit jobs.

How autoscaling behaves

  • Scales up from min to max in 2 steps.
  • Can scale down, even if the cluster is not idle, by looking at shuffle file state.
  • Scales down based on a percentage of current nodes.
  • On job clusters, scales down if the cluster is underutilized over the last 40 seconds.
  • On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.
  • The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The maximum value is 600.

Enable and configure autoscaling

To allow Azure Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide the min and max range of workers.

  1. Enable autoscaling.

    • All-Purpose cluster - On the cluster creation and edit page, select the Enable autoscaling checkbox in the Autopilot Options box:

      Enable autoscaling for interactive clusters

    • Job cluster - On the cluster creation and edit page, select the Enable autoscaling checkbox in the Autopilot Options box:

      Enable autoscaling for job clusters

  2. Configure the min and max workers.

Important

If you are using an instance pool:

  • Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. If it is larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.
  • Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. If it is larger, the cluster creation will fail.

Autoscaling example

If you reconfigure a static cluster to be an autoscaling cluster, Azure Databricks immediately resizes the cluster within the minimum and maximum bounds and then starts autoscaling. As an example, the following table demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale between 5 and 10 nodes.

Initial size Size after reconfiguration
6 6
12 10
3 5

Autoscaling local storage

It can often be difficult to estimate how much disk space a particular job will take. To save you from having to estimate how many gigabytes of managed disk to attach to your cluster at creation time, Azure Databricks automatically enables autoscaling local storage on all Azure Databricks clusters.

With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new managed disk to the worker before it runs out of disk space. Disks are attached up to a limit of 5 TB of total disk space per virtual machine (including the virtual machine’s initial local storage).

The managed disks attached to a virtual machine are detached only when the virtual machine is returned to Azure. That is, managed disks are never detached from a virtual machine as long as they are part of a running cluster. To scale down managed disk usage, Azure Databricks recommends using this feature in a cluster configured with Cluster size and autoscaling or Automatic termination.

Local disk encryption

Important

This feature is in Public Preview.

Some instance types you use to run clusters may have locally attached disks. Azure Databricks may store shuffle data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local disks, you can enable local disk encryption.

Important

Your workloads may run more slowly because of the performance impact of reading and writing encrypted data to and from local volumes.

When local disk encryption is enabled, Azure Databricks generates an encryption key locally that is unique to each cluster node, and is used to encrypt all data stored on local disks. The scope of the key is local to each cluster node and is destroyed along with the cluster node itself. During its lifetime, the key resides in memory for encryption and decryption, and is stored encrypted on the disk.

To enable local disk encryption, you must use the Clusters API 2.0. During cluster creation or edit, set:

{
  "enable_local_disk_encryption": true
}

See Create and Edit in the Clusters API reference for examples of how to invoke these APIs.

Here is an example of a cluster create call that enables local disk encryption:

{
  "cluster_name": "my-cluster",
  "spark_version": "7.3.x-scala2.12",
  "node_type_id": "Standard_D3_v2",
  "enable_local_disk_encryption": true,
  "spark_conf": {
    "spark.speculation": true
  },
  "num_workers": 25
}