Connect Python and pyodbc to Azure Databricks

Artikkel
09/24/2024

You can connect from your local Python code through ODBC to data in a Azure Databricks cluster or SQL warehouse. To do this, you can use the open source Python code module pyodbc.

Follow these instructions to install, configure, and use pyodbc.

For more information about pyodbc, see the pyodbc Wiki.

Note

Databricks offers the Databricks SQL Connector for Python as an alternative to pyodbc. The Databricks SQL Connector for Python is easier to set up and use, and has a more robust set of coding constructs, than pyodbc. However pyodbc may have better performance when fetching queries results above 10 MB.

These instructions were tested with Databricks ODBC driver 2.7.5, pyodbc 5.0.1, and unixODBC 2.3.12.

Requirements

A local development machine running one of the following:
- macOS
- Windows
- A Unix or Linux distribution that supports .rpm or .deb files
pip.
For Unix, Linux, or macOS, Homebrew.
An Azure Databricks cluster, a Databricks SQL warehouse, or both. For more information, see Compute configuration reference and Connect to a SQL warehouse.

Step 1: Download, install, and configure software

In this step, you download and install the Databricks ODBC driver, the unixodbc package, and the pyodbc module. (The pyodbc module requires the unixodbc package on Unix, Linux, and macOS.) You also configure an ODBC Data Source Name (DSN) to authenticate with and connect to your cluster or SQL warehouse.

Download and install the Databricks ODBC driver and and configure an ODBC DSN for your operating system.
For Unix, Linux, and macOS, install the unixodbc package: from the terminal, use Homebrew to run the command brew install unixodbc. For more information, see unixodbc on the Homebrew website.
Install the pyodbc module: from the terminal or command prompt, use pip to run the command pip install pyodbc. For more information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.

Step 2: Test your configuration

In this step, you write and run Python code to use your Azure Databricks cluster or Databricks SQL warehouse to query the trips table in the samples catalog’s nyctrips schema and display the results.

Create a file named pyodbc-demo.py with the following content. Replace <dsn-name> with the name of the ODBC DSN that you created earlier, save the file, and then run the file with your Python interpreter.

import pyodbc

# Connect to the Databricks cluster by using the
# Data Source Name (DSN) that you created earlier.
conn = pyodbc.connect("DSN=<dsn-name>", autocommit=True)

# Run a SQL query by using the preceding connection.
cursor = conn.cursor()
cursor.execute(f"SELECT * FROM samples.nyctaxi.trips")

# Print the rows retrieved from the query.
for row in cursor.fetchall():
  print(row)

To speed up running the code, start the cluster that corresponds to the HTTPPath setting in your DSN.
Run the pyodbc-demo.py file with your Python interpreter. Information about the table’s rows are displayed.

Next steps

To run the Python test code against a different cluster or SQL warehouse, create a different DSN and change <dsn-name> to the DSN’s name.
To run the Python test code with a different SQL query, change the execute command string.

Using a DSN-less connection

As an alternative to using an DSN name, you can specify the connection settings inline. The following example shows how to use a DSN-less connection string for Azure Databricks personal access token authentication. This example assumes that you have the following environment variables:

Set DATABRICKS_SERVER_HOSTNAME to the workspace instance name, for example adb-1234567890123456.7.azuredatabricks.net.
Set DATABRICKS_HTTP_PATH to the HTTP Path value for the target cluster or SQL warehouse in the workspace. To get the HTTP Path value, see Get connection details for an Azure Databricks compute resource.
Set DATABRICKS_TOKEN to the Azure Databricks personal access token for the target user. To create a personal access token, see Azure Databricks personal access tokens for workspace users.

To set environment variables, see your operating system’s documentation.

import pyodbc
import os

conn = pyodbc.connect(
  "Driver=/Library/simba/spark/lib/libsparkodbc_sb64-universal.dylib;" +
  f"Host={os.getenv('DATABRICKS_HOST')};" +
  "Port=443;" +
  f"HTTPPath={os.getenv('DATABRICKS_HTTP_PATH')};" +
  "SSL=1;" +
  "ThriftTransport=2;" +
  "AuthMech=3;" +
  "UID=token;" +
  f"PWD={os.getenv('DATABRICKS_TOKEN')}",
  autocommit = True
)

# Run a SQL query by using the preceding connection.
cursor = conn.cursor()
cursor.execute("SELECT * FROM samples.nyctaxi.trips")

# Print the rows retrieved from the query.
for row in cursor.fetchall():
  print(row)

The following example uses OAuth user-to-machine (U2M) or OAuth 2.0 browser-based authentication instead of an Azure Databricks personal access token. This example assumes that you have already set the preceding DATABRICKS_SERVER_HOSTNAME and DATABRICKS_HTTP_PATH environment variables.

import pyodbc
import os

conn = pyodbc.connect(
  "Driver=/Library/simba/spark/lib/libsparkodbc_sb64-universal.dylib;" +
  f"Host={os.getenv('DATABRICKS_HOST')};" +
  "Port=443;" +
  f"HTTPPath={os.getenv('DATABRICKS_HTTP_PATH')};" +
  "SSL=1;" +
  "ThriftTransport=2;" +
  "AuthMech=11;" +
  "Auth_Flow=2;" +
  "PWD=1234567",
  autocommit = True
)

# Run a SQL query by using the preceding connection.
cursor = conn.cursor()
cursor.execute("SELECT * FROM samples.nyctaxi.trips")

# Print the rows retrieved from the query.
for row in cursor.fetchall():
  print(row)

The following example uses OAuth machine-to-machine (M2M) or OAuth 2.0 client credentials authentication. This example assumes that you have already set the preceding DATABRICKS_SERVER_HOSTNAME and DATABRICKS_HTTP_PATH environment variables as well as the following environment variables:

Set ARM_CLIENT_ID to the service principal’s Application (client) ID value.
Set DATABRICKS_OAUTH_SECRET to the service principal’s OAuth Secret value. (Microsoft Entra ID secrets are not supported for OAuth M2M or OAuth 2.0 client credentials authentication with the Databricks ODBC Driver.)

For more information, see OAuth machine-to-machine (M2M) authentication.

import pyodbc
import os

conn = pyodbc.connect(
  "Driver=/Library/simba/spark/lib/libsparkodbc_sb64-universal.dylib;" +
  f"Host={os.getenv('DATABRICKS_HOST')};" +
  "Port=443;" +
  f"HTTPPath={os.getenv('DATABRICKS_HTTP_PATH')};" +
  "SSL=1;" +
  "ThriftTransport=2;" +
  "AuthMech=11;" +
  "Auth_Flow=1;" +
  f"Auth_Client_ID={os.getenv('ARM_CLIENT_ID')};" +
  f"Auth_Client_Secret={os.getenv('DATABRICKS_OAUTH_SECRET')}",
  autocommit = True
)

# Run a SQL query by using the preceding connection.
cursor = conn.cursor()
cursor.execute("SELECT * FROM samples.nyctaxi.trips")

# Print the rows retrieved from the query.
for row in cursor.fetchall():
  print(row)

Troubleshooting

This section addresses common issues when using pyodbc with Databricks.

Unicode decode error

Issue: You receive an error message similar to the following:

<class 'pyodbc.Error'> returned a result with an error set
Traceback (most recent call last):
File "/Users/user/.pyenv/versions/3.7.5/lib/python3.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 2112-2113: illegal UTF-16 surrogate

Cause: An issue exists in pyodbc version 4.0.31 or below that could manifest with such symptoms when running queries that return columns with long names or a long error message. The issue has been fixed by a newer version of pyodbc.

Solution: Upgrade your installation of pyodbc to version 4.0.32 or above.

General troubleshooting

See Issues in the mkleehammer/pyodbc repository on GitHub.

Del via