Install Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article describes how to install Databricks Connect for Python. See What is Databricks Connect?. For the Scala version of this article, see Install Databricks Connect for Scala.

Requirements

To install Databricks Connect for Python, the following requirements must be met:

  • If you are connecting to serverless compute, your workspace must meet the requirements for serverless compute.

  • If you are connecting to a cluster, your target cluster must meet the cluster configuration requirements, which includes Databricks Runtime version requirements.

  • You must have Python 3 installed on your development machine, and the minor version of Python installed on your development machine must meet the version requirements in the table below.

    Databricks Connect version Compute type Compatible Python version
    15.3 Cluster 3.11
    15.2 Cluster 3.11
    15.1 Cluster 3.11
    15.1 Serverless 3.10
    13.3 LTS to 14.3 LTS Cluster 3.10
  • If you want to use PySpark UDFs, your development machine’s installed minor version of Python must match the minor version of Python that is included with the Databricks Runtime installed on the cluster or serverless compute. To find the minor Python version of your cluster, refer to the System environment section of the Databricks Runtime release notes for your cluster or serverless compute. See Databricks Runtime release notes versions and compatibility and Serverless compute release notes.

Activate a Python virtual environment

Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. For more information about these tools and how to activate them, see venv or Poetry.

Install the Databricks Connect client

This section describes how to install the Databricks Connect client with venv or Poetry.

Note

If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions, because the Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.3 LTS and above. Skip to Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code.

Install the Databricks Connect client with venv

  1. With your virtual environment activated, uninstall PySpark, if it is already installed, by running the uninstall command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

    # Is PySpark already installed?
    pip3 show pyspark
    
    # Uninstall PySpark
    pip3 uninstall pyspark
    
  2. With your virtual environment still activated, install the Databricks Connect client by running the install command. Use the --upgrade option to upgrade any existing client installation to the specified version.

    pip3 install --upgrade "databricks-connect==14.3.*"  # Or X.Y.* to match your cluster version.
    

    Note

    Databricks recommends that you append the “dot-asterisk” notation to specify databricks-connect==X.Y.* instead of databricks-connect=X.Y, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Skip ahead to Configure connection properties.

Install the Databricks Connect client with Poetry

  1. With your virtual environment activated, uninstall PySpark, if it is already installed, by running the remove command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

    # Is PySpark already installed?
    poetry show pyspark
    
    # Uninstall PySpark
    poetry remove pyspark
    
  2. With your virtual environment still activated, install the Databricks Connect client by running the add command.

    poetry add databricks-connect@~14.3  # Or X.Y to match your cluster version.
    

    Note

    Databricks recommends that you use the “at-tilde” notation to specify databricks-connect@~14.3 instead of databricks-connect==14.3, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Configure connection properties

In this section, you configure properties to establish a connection between Databricks Connect and your Azure Databricks cluster or serverless compute, which includes the following:

Note

Configure a connection to a cluster

To configure a connection to a cluster, you will need the ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.

You can configure the connection to your cluster in one of the following ways. Databricks Connect searches for configuration properties in the following order, and uses the first configuration it finds. For advanced configuration information, see Advanced usage of Databricks Connect for Python.

  1. The DatabricksSession class’s remote() method.
  2. A Databricks configuration profile
  3. The DATABRICKS_CONFIG_PROFILE environment variable
  4. An environment variable for each configuration property
  5. A Databricks configuration profile named DEFAULT

The DatabricksSession class’s remote() method

For this option, which applies to Azure Databricks personal access token authentication only, specify the workspace instance name, the Azure Databricks personal access token, and the ID of the cluster.

You can initialize the DatabricksSession class in several ways, as follows:

  • Set the host, token, and cluster_id fields in DatabricksSession.builder.remote().
  • Use the Databricks SDK’s Config class.
  • Specify a Databricks configuration profile along with the cluster_id field.
  • Set the Spark Connect connection string in DatabricksSession.builder.remote().

Instead of specifying these connection properties in your code, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve_* functions to get the necessary properties from the user or from some other configuration store, such as Azure KeyVault.

The code for each of these approaches is as follows:

# Set the host, token, and cluster_id fields in DatabricksSession.builder.remote.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.remote(
   host       = f"https://{retrieve_workspace_instance_name()}",
   token      = retrieve_token(),
   cluster_id = retrieve_cluster_id()
).getOrCreate()
# Use the Databricks SDK's Config class.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   host       = f"https://{retrieve_workspace_instance_name()}",
   token      = retrieve_token(),
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
# Specify a Databricks configuration profile along with the `cluster_id` field.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   profile    = "<profile-name>",
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

A Databricks configuration profile

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

The required configuration profile fields for each authentication type are as follows:

Then set the name of this configuration profile through the Config class.

You can specify cluster_id in a few ways, as follows:

  • Include the cluster_id field in your configuration profile, and then just specify the configuration profile’s name.
  • Specify the configuration profile name along with the cluster_id field.

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The code for each of these approaches is as follows:

# Include the cluster_id field in your configuration profile, and then
# just specify the configuration profile's name:
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()
# Specify the configuration profile name along with the cluster_id field.
# In this example, retrieve_cluster_id() assumes some custom implementation that
# you provide to get the cluster ID from the user or from some other
# configuration store:
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   profile    = "<profile-name>",
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

The DATABRICKS_CONFIG_PROFILE environment variable

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The required configuration profile fields for each authentication type are as follows:

Set the DATABRICKS_CONFIG_PROFILE environment variable to the name of this configuration profile. Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

An environment variable for each configuration property

For this option, set the DATABRICKS_CLUSTER_ID environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.

The required environment variables for each authentication type are as follows:

Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

A Databricks configuration profile named DEFAULT

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The required configuration profile fields for each authentication type are as follows:

Name this configuration profile DEFAULT.

Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

Configure a connection to serverless compute

Important

This feature is in Public Preview.

Databricks Connect supports connecting to serverless compute. To use this feature, requirements for connecting to serverless must be met. See Requirements.

Important

This feature has the following limitations:

You can configure a connection to serverless compute in one of the following ways:

  • Set the local environment variable DATABRICKS_SERVERLESS_COMPUTE_ID to auto. If this environment variable is set, Databricks Connect ignores the cluster_id.

  • In a local Databricks configuration profile, set serverless_compute_id = auto, then reference that profile from your Databricks Connect Python code.

    [DEFAULT]
    host = https://my-workspace.cloud.databricks.com/
    serverless_compute_id = auto
    token = dapi123...
    
  • Alternatively, just update your Databricks Connect Python code as follows:

    from databricks.connect import DatabricksSession
    
    spark = DatabricksSession.builder.serverless(True).getOrCreate()
    
    from databricks.connect import DatabricksSession
    
    spark = DatabricksSession.builder.remote(serverless=True).getOrCreate()
    

Note

The serverless compute session times out after 10 minutes of inactivity. After this, the Python process needs to be restarted on the client side to create a new Spark session to connect to serverless compute.

Validate the connection to Databricks

To validate your environment, default credentials, and connection to compute are correctly set up for Databricks Connect, run the databricks-connect test command, which fails with a non-zero exit code and a corresponding error message when it detects any incompatibility in the setup.

databricks-connect test

Alternatively, you can use the pyspark shell that is included as part of Databricks Connect for Python, and run a simple command. For more details on the PySpark shell, see Pyspark shell.