Install Databricks Connect for Python

Articol
07/19/2024

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article describes how to install Databricks Connect for Python. See What is Databricks Connect?. For the Scala version of this article, see Install Databricks Connect for Scala.

Requirements

To install Databricks Connect for Python, the following requirements must be met:

If you are connecting to serverless compute, your workspace must meet the requirements for serverless compute.
If you are connecting to a cluster, your target cluster must meet the cluster configuration requirements, which includes Databricks Runtime version requirements.
You must have Python 3 installed on your development machine, and the minor version of Python installed on your development machine must meet the version requirements in the table below.

Databricks Connect version Compute type Compatible Python version

15.3 Cluster 3.11

15.2 Cluster 3.11

15.1 Cluster 3.11

15.1 Serverless 3.10

13.3 LTS to 14.3 LTS Cluster 3.10
If you want to use PySpark UDFs, your development machine’s installed minor version of Python must match the minor version of Python that is included with the Databricks Runtime installed on the cluster or serverless compute. To find the minor Python version of your cluster, refer to the System environment section of the Databricks Runtime release notes for your cluster or serverless compute. See Databricks Runtime release notes versions and compatibility and Serverless compute release notes.

Databricks Connect version	Compute type	Compatible Python version
15.3	Cluster	3.11
15.2	Cluster	3.11
15.1	Cluster	3.11
15.1	Serverless	3.10
13.3 LTS to 14.3 LTS	Cluster	3.10

Activate a Python virtual environment

Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. For more information about these tools and how to activate them, see venv or Poetry.

Install the Databricks Connect client

This section describes how to install the Databricks Connect client with venv or Poetry.

Note

If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions, because the Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.3 LTS and above. Skip to Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code.

Install the Databricks Connect client with venv

With your virtual environment activated, uninstall PySpark, if it is already installed, by running the uninstall command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.
```
# Is PySpark already installed?
pip3 show pyspark

# Uninstall PySpark
pip3 uninstall pyspark
```
With your virtual environment still activated, install the Databricks Connect client by running the install command. Use the --upgrade option to upgrade any existing client installation to the specified version.
```
pip3 install --upgrade "databricks-connect==14.3.*"  # Or X.Y.* to match your cluster version.
```
Note

Databricks recommends that you append the “dot-asterisk” notation to specify databricks-connect==X.Y.* instead of databricks-connect=X.Y, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Skip ahead to Configure connection properties.

Install the Databricks Connect client with Poetry

With your virtual environment activated, uninstall PySpark, if it is already installed, by running the remove command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.
```
# Is PySpark already installed?
poetry show pyspark

# Uninstall PySpark
poetry remove pyspark
```
With your virtual environment still activated, install the Databricks Connect client by running the add command.
```
poetry add databricks-connect@~14.3  # Or X.Y to match your cluster version.
```
Note

Databricks recommends that you use the “at-tilde” notation to specify databricks-connect@~14.3 instead of databricks-connect==14.3, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.

Configure connection properties

In this section, you configure properties to establish a connection between Databricks Connect and your Azure Databricks cluster or serverless compute, which includes the following:

The Azure Databricks workspace instance name. This is the Server Hostname value for your ompute. See Get connection details for an Azure Databricks compute resource.
Any other properties that are necessary for the Databricks authentication type that you want to use.

Note

OAuth user-to-machine (U2M) authentication is supported on Databricks SDK for Python 0.19.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Python to 0.19.0 or above to use OAuth U2M authentication. See Get started with the Databricks SDK for Python.

For OAuth U2M authentication, you must use the Databricks CLI to authenticate before you run your Python code. See the Tutorial.
OAuth machine-to-machine (M2M) authentication OAuth machine-to-machine (M2M) authentication is supported on Databricks SDK for Python 0.18.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Python to 0.18.0 or above to use OAuth M2M authentication. See Get started with the Databricks SDK for Python.
The Databricks SDK for Python has not yet implemented Azure managed identities authentication.

Configure a connection to a cluster

To configure a connection to a cluster, you will need the ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.

You can configure the connection to your cluster in one of the following ways. Databricks Connect searches for configuration properties in the following order, and uses the first configuration it finds. For advanced configuration information, see Advanced usage of Databricks Connect for Python.

The DatabricksSession class’s remote() method.
A Databricks configuration profile
The DATABRICKS_CONFIG_PROFILE environment variable
An environment variable for each configuration property
A Databricks configuration profile named DEFAULT

The `DatabricksSession` class’s `remote()` method

For this option, which applies to Azure Databricks personal access token authentication only, specify the workspace instance name, the Azure Databricks personal access token, and the ID of the cluster.

You can initialize the DatabricksSession class in several ways, as follows:

Set the host, token, and cluster_id fields in DatabricksSession.builder.remote().
Use the Databricks SDK’s Config class.
Specify a Databricks configuration profile along with the cluster_id field.
Set the Spark Connect connection string in DatabricksSession.builder.remote().

Instead of specifying these connection properties in your code, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve_* functions to get the necessary properties from the user or from some other configuration store, such as Azure KeyVault.

The code for each of these approaches is as follows:

# Set the host, token, and cluster_id fields in DatabricksSession.builder.remote.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.remote(
   host       = f"https://{retrieve_workspace_instance_name()}",
   token      = retrieve_token(),
   cluster_id = retrieve_cluster_id()
).getOrCreate()

# Use the Databricks SDK's Config class.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   host       = f"https://{retrieve_workspace_instance_name()}",
   token      = retrieve_token(),
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

# Specify a Databricks configuration profile along with the `cluster_id` field.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   profile    = "<profile-name>",
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

A Databricks configuration profile

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the Databricks authentication type that you want to use.

The required configuration profile fields for each authentication type are as follows:

For Azure Databricks personal access token authentication: host and token.
For OAuth machine-to-machine (M2M) authentication (where supported): host, client_id, and client_secret.
For OAuth user-to-machine (U2M) authentication (where supported): host.
For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: host, azure_tenant_id, azure_client_id, azure_client_secret, and possibly azure_workspace_resource_id.
For Azure CLI authentication: host.
For Azure managed identities authentication (where supported): host, azure_use_msi, azure_client_id, and possibly azure_workspace_resource_id.

Then set the name of this configuration profile through the Config class.

You can specify cluster_id in a few ways, as follows:

Include the cluster_id field in your configuration profile, and then just specify the configuration profile’s name.
Specify the configuration profile name along with the cluster_id field.

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The code for each of these approaches is as follows:

# Include the cluster_id field in your configuration profile, and then
# just specify the configuration profile's name:
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()

# Specify the configuration profile name along with the cluster_id field.
# In this example, retrieve_cluster_id() assumes some custom implementation that
# you provide to get the cluster ID from the user or from some other
# configuration store:
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   profile    = "<profile-name>",
   cluster_id = retrieve_cluster_id()
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

The `DATABRICKS_CONFIG_PROFILE` environment variable

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The required configuration profile fields for each authentication type are as follows:

For Azure Databricks personal access token authentication: host and token.
For OAuth machine-to-machine (M2M) authentication (where supported): host, client_id, and client_secret.
For OAuth user-to-machine (U2M) authentication (where supported): host.
For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: host, azure_tenant_id, azure_client_id, azure_client_secret, and possibly azure_workspace_resource_id.
For Azure CLI authentication: host.
For Azure managed identities authentication (where supported): host, azure_use_msi, azure_client_id, and possibly azure_workspace_resource_id.

Set the DATABRICKS_CONFIG_PROFILE environment variable to the name of this configuration profile. Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

An environment variable for each configuration property

For this option, set the DATABRICKS_CLUSTER_ID environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.

The required environment variables for each authentication type are as follows:

For Azure Databricks personal access token authentication: DATABRICKS_HOST and DATABRICKS_TOKEN.
For OAuth machine-to-machine (M2M) authentication (where supported): DATABRICKS_HOST, DATABRICKS_CLIENT_ID, and DATABRICKS_CLIENT_SECRET.
For OAuth user-to-machine (U2M) authentication (where supported): DATABRICKS_HOST.
For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: DATABRICKS_HOST, ARM_TENANT_ID, ARM_CLIENT_ID, ARM_CLIENT_SECRET, and possibly DATABRICKS_AZURE_RESOURCE_ID.
For Azure CLI authentication: DATABRICKS_HOST.
For Azure managed identities authentication (where supported): DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, and possibly DATABRICKS_AZURE_RESOURCE_ID.

Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

A Databricks configuration profile named `DEFAULT`

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The required configuration profile fields for each authentication type are as follows:

For Azure Databricks personal access token authentication: host and token.
For OAuth machine-to-machine (M2M) authentication (where supported): host, client_id, and client_secret.
For OAuth user-to-machine (U2M) authentication (where supported): host.
For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: host, azure_tenant_id, azure_client_id, azure_client_secret, and possibly azure_workspace_resource_id.
For Azure CLI authentication: host.
For Azure managed identities authentication (where supported): host, azure_use_msi, azure_client_id, and possibly azure_workspace_resource_id.

Name this configuration profile DEFAULT.

Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

Configure a connection to serverless compute

Important

This feature is in Public Preview.

Databricks Connect supports connecting to serverless compute. To use this feature, requirements for connecting to serverless must be met. See Requirements.

Important

This feature has the following limitations:

All of the Databricks Connect for Python limitations
All of the serverless compute limitations
Only Python dependencies that are included as part of serverless compute environment can be used for UDFs. See System environment. Additional dependencies cannot be installed.
UDFs with custom modules are not supported.

You can configure a connection to serverless compute in one of the following ways:

Set the local environment variable DATABRICKS_SERVERLESS_COMPUTE_ID to auto. If this environment variable is set, Databricks Connect ignores the cluster_id.
In a local Databricks configuration profile, set serverless_compute_id = auto, then reference that profile from your Databricks Connect Python code.
```
[DEFAULT]
host = https://my-workspace.cloud.databricks.com/
serverless_compute_id = auto
token = dapi123...
```

Alternatively, just update your Databricks Connect Python code as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.serverless(True).getOrCreate()

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.remote(serverless=True).getOrCreate()

Note

The serverless compute session times out after 10 minutes of inactivity. After this, the Python process needs to be restarted on the client side to create a new Spark session to connect to serverless compute.

Validate the connection to Databricks

To validate your environment, default credentials, and connection to compute are correctly set up for Databricks Connect, run the databricks-connect test command, which fails with a non-zero exit code and a corresponding error message when it detects any incompatibility in the setup.

databricks-connect test

Alternatively, you can use the pyspark shell that is included as part of Databricks Connect for Python, and run a simple command. For more details on the PySpark shell, see Pyspark shell.

Partajați prin

Install Databricks Connect for Python

Requirements

Activate a Python virtual environment

Install the Databricks Connect client

Install the Databricks Connect client with venv

Install the Databricks Connect client with Poetry

Configure connection properties

Configure a connection to a cluster

The `DatabricksSession` class’s `remote()` method

A Databricks configuration profile

The `DATABRICKS_CONFIG_PROFILE` environment variable

An environment variable for each configuration property

A Databricks configuration profile named `DEFAULT`

Configure a connection to serverless compute

Validate the connection to Databricks

Feedback

Feedback

Resurse suplimentare

Partajați prin

Install Databricks Connect for Python

Requirements

Activate a Python virtual environment

Install the Databricks Connect client

Install the Databricks Connect client with venv

Install the Databricks Connect client with Poetry

Configure connection properties

Configure a connection to a cluster

The DatabricksSession class’s remote() method

A Databricks configuration profile

The DATABRICKS_CONFIG_PROFILE environment variable

An environment variable for each configuration property

A Databricks configuration profile named DEFAULT

Configure a connection to serverless compute

Validate the connection to Databricks

Feedback

Feedback

Resurse suplimentare

The `DatabricksSession` class’s `remote()` method

The `DATABRICKS_CONFIG_PROFILE` environment variable

A Databricks configuration profile named `DEFAULT`