Install Databricks Connect for Scala

Artikkel
08/20/2024

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article describes how to install Databricks Connect for Scala. See What is Databricks Connect?. For the Python version of this article, see Install Databricks Connect for Python.

Requirements

Your target Azure Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.
The Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Azure Databricks cluster. To find the JDK version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. For instance, Zulu 8.70.0.23-CA-linux64 corresponds to JDK 8. See Databricks Runtime release notes versions and compatibility.
Scala installed on your development machine. Databricks recommends that the version of your Scala installation you use matches the Scala version on your Azure Databricks cluster. To find the Scala version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
A Scala build tool on your development machine, such as sbt.

Set up the client

After you meet the requirements for Databricks Connect, complete the following steps to set up the Databricks Connect client.

Step 1: Add a reference to the Databricks Connect client

In your Scala project’s build file such as build.sbt for sbt, pom.xml for Maven, or build.gradle for Gradle, add the following reference to the Databricks Connect client:

Sbt

libraryDependencies += "com.databricks" % "databricks-connect" % "14.0.0"

Maven

<dependency>
  <groupId>com.databricks</groupId>
  <artifactId>databricks-connect</artifactId>
  <version>14.0.0</version>
</dependency>

Gradle

implementation 'com.databricks.databricks-connect:14.0.0'

Replace 14.0.0 with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.

Step 2: Configure connection properties

In this section, you configure properties to establish a connection between Databricks Connect and your remote Azure Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.

For Databricks Connect for Databricks Runtime 13.3 LTS and above, for Scala, Databricks Connect includes the Databricks SDK for Java. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Azure Databricks more centralized and predictable. It enables you to configure Azure Databricks authentication once and then use that configuration across multiple Azure Databricks tools and SDKs without further authentication configuration changes.

Note

OAuth user-to-machine (U2M) authentication is supported on Databricks SDK for Java 0.18.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Java to 0.18.0 or above to use OAuth U2M authentication. See Get started with the Databricks SDK for Java.

For OAuth U2M authentication, you must use the Databricks CLI to authenticate before you run your Scala code. See the Tutorial.
OAuth machine-to-machine (M2M) authentication is supported on Databricks SDK for Java 0.17.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Java to 0.17.0 or above to use OAuth U2M authentication. See Get started with the Databricks SDK for Java.
The Databricks SDK for Java has not yet implemented Azure managed identities authentication.

Collect the following configuration properties.
- The Azure Databricks workspace instance name. This is the same as the Server Hostname value for your cluster; see Get connection details for an Azure Databricks compute resource.
- The ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
- Any other properties that are necessary for the supported Databricks authentication type. These properties are described throughout this section.

Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options. The details for each option appear after the following table:

Configuration properties option	Applies to
1. The `DatabricksSession` class’s `remote()` method	Azure Databricks personal access token authentication only
2. An Azure Databricks configuration profile	All Azure Databricks authentication types
3. The `SPARK_REMOTE` environment variable	Azure Databricks personal access token authentication only
4. The `DATABRICKS_CONFIG_PROFILE` environment variable	All Azure Databricks authentication types
5. An environment variable for each configuration property	All Azure Databricks authentication types
6. An Azure Databricks configuration profile named `DEFAULT`	All Azure Databricks authentication types

The DatabricksSession class’s remote() method

For this option, which applies to Azure Databricks personal access token authentication only, specify the workspace instance name, the Azure Databricks personal access token, and the ID of the cluster.

You can initialize the DatabricksSession class in several ways, as follows:

Set the host, token, and clusterId fields in DatabricksSession.builder.
Use the Databricks SDK’s Config class.
Specify a Databricks configuration profile along with the clusterId field.

Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve* functions yourself to get the necessary properties from the user or from some other configuration store, such as Azure KeyVault.

The code for each of these approaches is as follows:

// Set the host, token, and clusterId fields in DatabricksSession.builder.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder()
  .host(retrieveWorkspaceInstanceName())
  .token(retrieveToken())
  .clusterId(retrieveClusterId())
  .getOrCreate()

// Use the Databricks SDK's Config class.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig

val config = new DatabricksConfig()
  .setHost(retrieveWorkspaceInstanceName())
  .setToken(retrieveToken())
val spark = DatabricksSession.builder()
  .sdkConfig(config)
  .clusterId(retrieveClusterId())
  .getOrCreate()

// Specify a Databricks configuration profile along with the clusterId field.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig

val config = new DatabricksConfig()
  .setProfile("<profile-name>")
val spark = DatabricksSession.builder()
  .sdkConfig(config)
  .clusterId(retrieveClusterId())
  .getOrCreate()

An Azure Databricks configuration profile

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the supported Databricks authentication type that you want to use.

The required configuration profile fields for each authentication type are as follows:
- For Azure Databricks personal access token authentication: host and token.
- For OAuth machine-to-machine (M2M) authentication (where supported): host, client_id, and client_secret.
- For OAuth user-to-machine (U2M) authentication (where supported): host.
- For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: host, azure_tenant_id, azure_client_id, azure_client_secret, and possibly azure_workspace_resource_id.
- For Azure CLI authentication: host.
- For Azure managed identities authentication (where supported): host, azure_use_msi, azure_client_id, and possibly azure_workspace_resource_id.
Then set the name of this configuration profile through the DatabricksConfig class.

You can specify cluster_id in a few ways, as follows:
- Include the cluster_id field in your configuration profile, and then just specify the configuration profile’s name.
- Specify the configuration profile name along with the clusterId field.
If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify the cluster_id or clusterId fields.

The code for each of these approaches is as follows:
```
// Include the cluster_id field in your configuration profile, and then
// just specify the configuration profile's name:
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig

val config = new DatabricksConfig()
  .setProfile("<profile-name>")
  val spark = DatabricksSession.builder()
  .sdkConfig(config)
  .getOrCreate()

// Specify the configuration profile name along with the clusterId field.
// In this example, retrieveClusterId() assumes some custom implementation that
// you provide to get the cluster ID from the user or from some other
// configuration store:
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig

val config = new DatabricksConfig()
  .setProfile("<profile-name>")
val spark = DatabricksSession.builder()
  .sdkConfig(config)
  .clusterId(retrieveClusterId())
  .getOrCreate()
```
The SPARK_REMOTE environment variable

For this option, which applies to Azure Databricks personal access token authentication only, set the SPARK_REMOTE environment variable to the following string, replacing the placeholders with the appropriate values.
```
sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
```
Then initialize the DatabricksSession class as follows:
```
import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder().getOrCreate()
```
To set environment variables, see your operating system’s documentation.
The DATABRICKS_CONFIG_PROFILE environment variable

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the supported Databricks authentication type that you want to use.

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The required configuration profile fields for each authentication type are as follows:
- For Azure Databricks personal access token authentication: host and token.
- For OAuth machine-to-machine (M2M) authentication (where supported): host, client_id, and client_secret.
- For OAuth user-to-machine (U2M) authentication (where supported): host.
- For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: host, azure_tenant_id, azure_client_id, azure_client_secret, and possibly azure_workspace_resource_id.
- For Azure CLI authentication: host.
- For Azure managed identities authentication (where supported): host, azure_use_msi, azure_client_id, and possibly azure_workspace_resource_id.
Set the DATABRICKS_CONFIG_PROFILE environment variable to the name of this configuration profile. Then initialize the DatabricksSession class as follows:
```
import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder().getOrCreate()
```
To set environment variables, see your operating system’s documentation.
An environment variable for each configuration property

For this option, set the DATABRICKS_CLUSTER_ID environment variable and any other environment variables that are necessary for the supported Databricks authentication type that you want to use.

The required environment variables for each authentication type are as follows:
- For Azure Databricks personal access token authentication: DATABRICKS_HOST and DATABRICKS_TOKEN.
- For OAuth machine-to-machine (M2M) authentication (where supported): DATABRICKS_HOST, DATABRICKS_CLIENT_ID, and DATABRICKS_CLIENT_SECRET.
- For OAuth user-to-machine (U2M) authentication (where supported): DATABRICKS_HOST.
- For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: DATABRICKS_HOST, ARM_TENANT_ID, ARM_CLIENT_ID, ARM_CLIENT_SECRET, and possibly DATABRICKS_AZURE_RESOURCE_ID.
- For Azure CLI authentication: DATABRICKS_HOST.
- For Azure managed identities authentication (where supported): DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, and possibly DATABRICKS_AZURE_RESOURCE_ID.
Then initialize the DatabricksSession class as follows:
```
import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder().getOrCreate()
```
To set environment variables, see your operating system’s documentation.
An Azure Databricks configuration profile named DEFAULT

For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the supported Databricks authentication type that you want to use.

If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

The required configuration profile fields for each authentication type are as follows:
- For Azure Databricks personal access token authentication: host and token.
- For OAuth machine-to-machine (M2M) authentication (where supported): host, client_id, and client_secret.
- For OAuth user-to-machine (U2M) authentication (where supported): host.
- For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication: host, azure_tenant_id, azure_client_id, azure_client_secret, and possibly azure_workspace_resource_id.
- For Azure CLI authentication: host.
- For Azure managed identities authentication (where supported): host, azure_use_msi, azure_client_id, and possibly azure_workspace_resource_id.
Name this configuration profile DEFAULT.

Then initialize the DatabricksSession class as follows:
```
scala
import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder().getOrCreate()
```

Del via