Advanced usage of Databricks Connect for Python

Artikel
07/15/2024

Note

This article covers Databricks Connect for Databricks Runtime 14.0 and above.

This article describes topics that go beyond the basic setup of Databricks Connect.

Configure the Spark Connect connection string

In addition to connecting to your cluster using the options outlined in Configure a connection to a cluster, a more advanced option is connecting using the Spark Connect connection string. You can pass the string in the remote function or set the SPARK_REMOTE environment variable.

Note

You can only use a Databricks personal access token authentication to connect using the Spark Connect connection string.

To set the connection string using the remote function:

# Set the Spark Connect connection string in DatabricksSession.builder.remote.
from databricks.connect import DatabricksSession

workspace_instance_name = retrieve_workspace_instance_name()
token                   = retrieve_token()
cluster_id              = retrieve_cluster_id()

spark = DatabricksSession.builder.remote(
   f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}"
).getOrCreate()

Alternatively, set the SPARK_REMOTE environment variable:

sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>

Then initialize the DatabricksSession class as follows:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

Pyspark shell

Databricks Connect for Python ships with a pyspark binary which is a PySpark REPL (a Spark shell) configured to use Databricks Connect. The REPL can be started by running:

pyspark

When started with no additional parameters, it picks up default credentials from the environment (for example., the DATABRICKS_ environment variables or the DEFAULT configuration profile) to connect to the Azure Databricks cluster.

Once the REPL starts up, the spark object is available configured to run Apache Spark commands on the Databricks cluster.

>>> spark.range(3).show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
+---+

The REPL can be configured to connect to a different remote by configuring the --remote parameter with a Spark connect connection string.

pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"

To stop the shell, press Ctrl + d or Ctrl + z, or run the command quit() or exit().

Additional HTTP headers

Databricks Connect communicates with the Databricks Clusters via gRPC over HTTP/2.

Some advanced users may choose to install a proxy service between the client and the Azure Databricks cluster, to have better control over the requests coming from their clients.

The proxies, in some cases, may require custom headers in the HTTP requests.

The headers() method can be used to add custom headers to their HTTP requests.

spark = DatabricksSession.builder.header('x-custom-header', 'value').getOrCreate()

Certificates

If your cluster relies on a custom SSL/TLS certificate to resolve a Azure Databricks workspace fully qualified domain name (FQDN), you must set the environment variable GRPC_DEFAULT_SSL_ROOTS_FILE_PATH on your local development machine. This environment variable must be set to the full path to the installed certificate on the cluster.

For example, you set this environment variable in Python code as follows:

import os

os.environ["GRPC_DEFAULT_SSL_ROOTS_FILE_PATH"] = "/etc/ssl/certs/ca-bundle.crt"

For other ways to set environment variables, see your operating system’s documentation.

Logging and debug logs

Databricks Connect for Python produces logs using standard Python logging.

Logs are emitted to the standard error stream (stderr) and by default they are only logs at WARN level and higher are emitted.

Setting an environment variable SPARK_CONNECT_LOG_LEVEL=debug will modify this default and print all log messages at the DEBUG level and higher.

Del via