Databricks SQL Connector for Python
The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Azure Databricks clusters and Databricks SQL warehouses. The Databricks SQL Connector for Python is easier to set up and use than similar Python libraries such as pyodbc. This library follows PEP 249 – Python Database API Specification v2.0.
Requirements
- A development machine running Python >=3.7 and <=3.11.
- An existing cluster or SQL warehouse.
Get started
Gather the following information for the cluster or SQL warehouse that you want to use:
Cluster
- The server hostname of the cluster. You can get this from the Server Hostname value in the Advanced Options > JDBC/ODBC tab for your cluster.
- The HTTP path of the cluster. You can get this from the HTTP Path value in the Advanced Options > JDBC/ODBC tab for your cluster.
- A valid access token. You can use an Azure Databricks personal access token for the workspace. You can also use an Azure Active Directory access token.
Note
As a security best practice, you should not hard-code this information into your code. Instead, you should retrieve this information from a secure location. For example, the code examples later in this article use environment variables.
Sql warehouse
- The server hostname of the SQL warehouse. You can get this from the Server Hostname value in the Connection Details tab for your SQL warehouse.
- The HTTP path of the SQL warehouse. You can get this from the HTTP Path value in the Connection Details tab for your SQL warehouse.
- A valid access token. You can use an Azure Databricks personal access token for the workspace. You can also use an Azure Active Directory access token.
Note
As a security best practice, you should not hard-code this information into your code. Instead, you should retrieve this information from a secure location. For example, the code examples later in this article use environment variables.
Install the Databricks SQL Connector for Python library on your development machine by running
pip install databricks-sql-connector
.
Examples
The following code examples demonstrate how to use the Databricks SQL Connector for Python to query and insert data, query metadata, manage cursors and connections, and configure logging.
These code example retrieve their server_hostname
, http_path
, and access_token
connection variable values from these environment variables:
DATABRICKS_SERVER_HOSTNAME
, which represents the Server Hostname value from the requirements.DATABRICKS_HTTP_PATH
, which represents the HTTP Path value from the requirements.DATABRICKS_TOKEN
, which represents your access token from the requirements.
You can use other approaches to retrieving these connection variable values. Using environment variables is just one approach among many.
Query data
The following code example demonstrates how to call the Databricks SQL Connector for Python to run a basic SQL command on a cluster or SQL warehouse. This command returns the first two rows from the diamonds
table.
Note
The diamonds
table is included in the Sample datasets.
from databricks import sql
import os
with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:
with connection.cursor() as cursor:
cursor.execute("SELECT * FROM default.diamonds LIMIT 2")
result = cursor.fetchall()
for row in result:
print(row)
Insert data
The following example demonstrate how to insert small amounts of data (thousands of rows):
from databricks import sql
import os
with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:
with connection.cursor() as cursor:
cursor.execute("CREATE TABLE IF NOT EXISTS squares (x int, x_squared int)")
squares = [(i, i * i) for i in range(100)]
values = ",".join([f"({x}, {y})" for (x, y) in squares])
cursor.execute(f"INSERT INTO squares VALUES {values}")
cursor.execute("SELECT * FROM squares LIMIT 10")
result = cursor.fetchall()
for row in result:
print(row)
For large amounts of data, you should first upload the data to cloud storage and then execute the COPY INTO command.
Query metadata
There are dedicated methods for retrieving metadata. The following example retrieves metadata about columns in a sample table:
from databricks import sql
import os
with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:
with connection.cursor() as cursor:
cursor.columns(schema_name="default", table_name="squares")
print(cursor.fetchall())
Manage cursors and connections
It is best practice to close any connections and cursors that have been finished with. This frees resources on Azure Databricks clusters and Databricks SQL warehouses.
You can use a context manager (the with
syntax used in previous examples) to manage the resources, or explicitly call close
:
from databricks import sql
import os
connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))
cursor = connection.cursor()
cursor.execute("SELECT * from range(10)")
print(cursor.fetchall())
cursor.close()
connection.close()
Configure logging
The Databricks SQL Connector uses Python’s standard logging module. You can configure the logging level similar to the following:
from databricks import sql
import os, logging
logging.getLogger("databricks.sql").setLevel(logging.DEBUG)
logging.basicConfig(filename = "results.log",
level = logging.DEBUG)
connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))
cursor = connection.cursor()
cursor.execute("SELECT * from range(10)")
result = cursor.fetchall()
for row in result:
logging.debug(row)
cursor.close()
connection.close()
API reference
Package
databricks-sql-connector
Usage: pip install databricks-sql-connector
See also databricks-sql-connector in the Python Package Index (PyPI).
Module
databricks.sql
Usage: from databricks import sql
Methods
connect
method
Creates a connection to a database.
Returns a Connection object.
Parameters |
---|
server_hostname Type: str The server hostname for the cluster or SQL warehouse. To get the server hostname, see the instructions earlier in this article. This parameter is required. Example: adb-1234567890123456.7.azuredatabricks.net |
http_path Type: str The HTTP path of the cluster or SQL warehouse. To get the HTTP path, see the instructions earlier in this article. This parameter is required. Example: sql/protocolv1/o/1234567890123456/1234-567890-test123 for a cluster./sql/1.0/warehouses/a1b234c567d8e9fa for a SQL warehouse. |
access_token Type: str Your Azure Databricks personal access token or Azure Active Directory token for the workspace for the cluster or SQL warehouse. To create a token, see the instructions earlier in this article. This parameter is required. Example: dapi...<the-remaining-portion-of-your-token> |
session_configuration Type: dict[str, Any] A dictionary of Spark session configuration parameters. Setting a configuration is equivalent to using the SET key=val SQL command. Run the SQL command SET -v to get a full list of available configurations.Defaults to None .This parameter is optional. Example: {"spark.sql.variable.substitute": True} |
http_headers Type: List[Tuple[str, str]]] Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. Typical usage will not set any extra HTTP headers. Defaults to None .This parameter is optional. Since version 2.0 |
catalog Type: str Initial catalog to use for the connection. Defaults to None (in which case the default catalog, typically hive_metastore , will be used).This parameter is optional. Since version 2.0 |
schema Type: str Initial schema to use for the connection. Defaults to None (in which case the default schema default will be used).This parameter is optional. Since version 2.0 |
use_cloud_fetch Type: bool True to send fetch requests directly to the cloud object store to download chunks of data. False (the default) to send fetch requests directly to Azure Databricks.If use_cloud_fetch is set to True but network access is blocked, then the fetch requests will fail.Since version 2.8 |
Classes
Connection
class
Represents a connection to a database.
Methods
close
method
Closes the connection to the database and releases all associated resources on the server. Any additional calls to this connection will throw an Error
.
No parameters.
No return value.
cursor
method
Returns a mechanism that enables traversal over the records in a database.
No parameters.
Returns a Cursor object.
Cursor
class
Attributes
arraysize
attribute
Used with the fetchmany method, specifies the internal buffer size, which is also how many rows are actually fetched from the server at a time. The default value is 10000
. For narrow results (results in which each row does not contain a lot of data), you should increase this value for better performance.
Read-write access.
description
attribute
Contains a Python list
of tuple
objects. Each of these tuple
objects contains 7 values, with the first 2 items of each tuple
object containing information describing a single result column as follows:
name
: The name of the column.type_code
: A string representing the type of the column. For example, an integer column will have a type code ofint
.
The remaining 5 items of each 7-item tuple
object are not implemented, and their values are not defined. They will typically be returned as 4 None
values followed by a single True
value.
Read-only access.
Methods
cancel
method
Interrupts the running of any database query or command that the cursor has started. To release the associated resources on the server, call the close method after calling the cancel
method.
No parameters.
No return value.
close
method
Closes the cursor and releases the associated resources on the server. Closing an already closed cursor might throw an error.
No parameters.
No return value.
execute
method
Prepares and then runs a database query or command.
No return value.
Parameters |
---|
operation Type: str The query or command to prepare and then run. This parameter is required. Example without the parameters parameter:cursor.execute( 'SELECT * FROM default.diamonds WHERE cut="Ideal" LIMIT 2' ) Example with the parameters parameter:cursor.execute( 'SELECT * FROM default.diamonds WHERE cut=%(cut_type)s LIMIT 2', { 'cut_type': 'Ideal' } ) |
parameters Type: dictionary A sequence of parameters to use with the operation parameter.This parameter is optional. The default is None . |
executemany
method
Prepares and then runs a database query or command using all parameter sequences in the seq_of_parameters
argument. Only the final result set is retained.
No return value.
Parameters |
---|
operation Type: str The query or command to prepare and then run. This parameter is required. |
seq_of_parameters Type: list of dict A sequence of many sets of parameter values to use with the operation parameter.This parameter is required. |
catalogs
method
Execute a metadata query about the catalogs. Actual results should then be fetched using fetchmany
or fetchall
.
Important fields in the result set include:
- Field name:
TABLE_CAT
. Type:str
. The name of the catalog.
No parameters.
No return value.
Since version 1.0
schemas
method
Execute a metadata query about the schemas. Actual results should then be fetched using fetchmany
or fetchall
.
Important fields in the result set include:
- Field name:
TABLE_SCHEM
. Type:str
. The name of the schema. - Field name:
TABLE_CATALOG
. Type:str
. The catalog to which the schema belongs.
No return value.
Since version 1.0
Parameters |
---|
catalog_name Type: str A catalog name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
schema_name Type: str A schema name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
tables
method
Execute a metadata query about tables and views. Actual results should then be fetched using fetchmany
or fetchall
.
Important fields in the result set include:
- Field name:
TABLE_CAT
. Type:str
. The catalog to which the table belongs. - Field name:
TABLE_SCHEM
. Type:str
. The schema to which the table belongs. - Field name:
TABLE_NAME
. Type:str
. The name of the table. - Field name:
TABLE_TYPE
. Type:str
. The kind of relation, for exampleVIEW
orTABLE
(applies to Databricks Runtime 10.2 and above as well as to Databricks SQL; prior versions of the Databricks Runtime return an empty string).
No return value.
Since version 1.0
Parameters |
---|
catalog_name Type: str A catalog name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
schema_name Type: str A schema name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
table_name Type: str A table name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
table_types Type: List[str] A list of table types to match, for example TABLE or VIEW .This parameter is optional. |
columns
method
Execute a metadata query about the columns. Actual results should then be fetched using fetchmany
or fetchall
.
Important fields in the result set include:
- Field name:
TABLE_CAT
. Type:str
. The catalog to which the column belongs. - Field name:
TABLE_SCHEM
. Type:str
. The schema to which the column belongs. - Field name:
TABLE_NAME
. Type:str
. The name of the table to which the column belongs. - Field name:
COLUMN_NAME
. Type:str
. The name of the column.
No return value.
Since version 1.0
Parameters |
---|
catalog_name Type: str A catalog name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
schema_name Type: str A schema name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
table_name Type: str A table name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
column_name Type: str A column name to retrieve information about. The % character is interpreted as a wildcard.This parameter is optional. |
fetchall
method
Gets all (or all remaining) rows of a query.
No parameters.
Returns all (or all remaining) rows of the query as a Python list
of Row
objects.
Throws an Error
if the previous call to the execute method did not return any data or no execute
call has yet been made.
fetchmany
method
Gets the next rows of a query.
Returns up to size
(or the arraysize attribute if size
is not specified) of the next rows of a query as a Python list
of Row
objects.
If there are fewer than size
rows left to be fetched, all remaining rows will be returned.
Throws an Error
if the previous call to the execute method did not return any data or no execute
call has yet been made.
Parameters |
---|
size Type: int The number of next rows to get. This parameter is optional. If not specified, the value of the arraysize attribute is used.Example: cursor.fetchmany(10) |
fetchone
method
Gets the next row of the dataset.
No parameters.
Returns the next row of the dataset as a single sequence as a Python tuple
object, or returns None
if there is no more available data.
Throws an Error
if the previous call to the execute method did not return any data or no execute
call has yet been made.
fetchall_arrow
method
Gets all (or all remaining) rows of a query, as a PyArrow Table
object.
Queries returning very large amounts of data should use fetchmany_arrow
instead to reduce memory consumption.
No parameters.
Returns all (or all remaining) rows of the query as a PyArrow table.
Throws an Error
if the previous call to the execute method did not return any data or no execute
call has yet been made.
Since version 2.0
fetchmany_arrow
method
Gets the next rows of a query as a PyArrow Table
object.
Returns up to the size
argument (or the arraysize attribute if size
is not specified) of the next rows of a query as a Python PyArrow Table
object.
Throws an Error
if the previous call to the execute method did not return any data or no execute
call has yet been made.
Since version 2.0
Parameters |
---|
size Type: int The number of next rows to get. This parameter is optional. If not specified, the value of the arraysize attribute is used.Example: cursor.fetchmany_arrow(10) |
Row
class
The row class is a tuple-like data structure that represents an individual result row.
If the row contains a column with the name "my_column"
, you can access the "my_column"
field of row
via
row.my_column
. You can also use numeric indicies to access fields, for example row[0]
.
If the column name is not allowed as an attribute method name (for example, it begins with a digit),
then you can access the field as row["1_my_column"]
.
Since version 1.0
Methods
asDict
method
Return a dictionary representation of the row, which is indexed by field names. If there are duplicate field names, one of the duplicate fields (but only one) will be returned in the dictionary. Which duplicate field is returned is not defined.
No parameters.
Returns a dict
of fields.
Type conversions
The following table maps Apache Spark SQL data types to their Python data type equivalents.
Apache Spark SQL data type | Python data type |
---|---|
array |
numpy.ndarray |
bigint |
int |
binary |
bytearray |
boolean |
bool |
date |
datetime.date |
decimal |
decimal.Decimal |
double |
float |
int |
int |
map |
str |
null |
NoneType |
smallint |
int |
string |
str |
struct |
str |
timestamp |
datetime.datetime |
tinyint |
int |
Troubleshooting
tokenAuthWrapperInvalidAccessToken: Invalid access token
message
Issue: When you run your code, you see a message similar to Error during request to server: tokenAuthWrapperInvalidAccessToken: Invalid access token
.
Possible cause: The value passed to access_token
is not a valid Azure Databricks personal access token.
Recommended fix: Check that the value passed to access_token
is correct and try again.
gaierror(8, 'nodename nor servname provided, or not known')
message
Issue: When you run your code, you see a message similar to Error during request to server: gaierror(8, 'nodename nor servname provided, or not known')
.
Possible cause: The value passed to server_hostname
is not the correct host name.
Recommended fix: Check that the value passed to server_hostname
is correct and try again.
For more information on finding the server hostname, see Retrieve the connection details.
IpAclError
message
Issue: When you run your code, you see the message Error during request to server: IpAclValidation
when you try to use the
connector on an Azure Databricks notebook.
Possible cause: You may have IP allow listing enabled for the Azure Databricks workspace. With IP allow listing, connections from Spark clusters back to the control plane are not allowed by default.
Recommended fix: Ask your administrator to add the data plane subnet to the IP allow list.
Additional resources
For more information, see:
- The Databricks SQL Connector for Python repository on GitHub
- Data types
- Built-in Types (for
bool
,bytearray
,float
,int
, andstr
) on the Python website - datetime (for
datetime.date
anddatatime.datetime
) on the Python website - decimal (for
decimal.Decimal
) on the Python website - Built-in Constants (for
NoneType
) on the Python website
Feedback
Submit and view feedback for