Missing ODBC drivers on spark 3.3 & 3.4 version

Question

Hi Team - We are upgrading our Spark version from 3.2 to 3.4 and using the pyodbc module for SQL operations. However, we are encountering the following error when attempting to connect to the Azure SQL Database. Could someone please assist us?

Error:

exception ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)")

Pyodbc version:

spark 3.4 -- 5.0.1

spark 3.3 -- 4.0.34

spark 3.2 -- 4.0.30

I have installed the latest version of pyodbc, 5.1.0, on Spark 3.4 and 3.3, but I am still encountering the same error.

code:

def get_copt_ss_acess_token(access_token):

""" Create packed bytes for authenticating with sql server using a token."""

tokenb = bytes(access_token, 'UTF-8')

exptoken = b''

for i in tokenb:

    exptoken += bytes({i})

    exptoken += bytes(1)

tokenstruct = struct.pack("=i", len(exptoken)) + exptoken

return tokenstruct

def get_pyodbc_connnection( access_token, connection_string):

    """ Creates a pyodbc connection using an access token."""

    SQL_COPT_SS_ACCESS_TOKEN = 1256

    copt_ss_at = get_copt_ss_acess_token(access_token)

    conn = None

    retry_flag = True

    retry_count = 0

    max_retry_count = 5

    while retry_flag and retry_count < max_retry_count:

        try:

            conn = pyodbc.connect(connection_string, attrs_before={ SQL_COPT_SS_ACCESS_TOKEN: copt_ss_at })

            retry_flag = False

            return conn

        except Exception as ex:

            print('exception {0} {1}'.format(type(ex), str(ex)))

            time.sleep(300) # Retry after 300 seconds

            retry_count = retry_count + 1

    if retry_count >= max_retry_count:

        errormsg = 'Failed to connect to sql with retries'

        print(errormsg)

        raise Exception(errormsg)

def execute_sql_statement(sql_statement):

    """ execute sql statement"""

    access_token = mssparkutils.credentials.getConnectionStringOrCreds("LinkedServiceName")

    connection_string = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server_name};DATABASE={database_name}'

    conn =  get_pyodbc_connnection(access_token, connection_string)

    cur = conn.cursor()

    try:         

        cur.execute(sql_statement)

        conn.commit()

        print('Transactions Committed')

    except Exception as e:

        print('roll back', e)

        cur.rollback()

        raise e

    finally:

        print('Close the Cursor and Connection')

        cur.close()

        conn.close()

def execute_sql_stored_procedure(stored_procedure_name):

""" execute stored procedure"""

execute_sql_statement('EXEC {0}'.format(stored_procedure_name))

sqlstoredprocedurename = '[dbo].[usp_test]'

execute_sql_stored_procedure(f'{sqlstoredprocedurename}')

Thanks,

Abhiram

Accepted Answer

@Abhiram Duvvuru - Thanks for the question and using MS Q&A platform.

There are two primary ways to install a library on a Spark pool:

Install a workspace library that has been uploaded as a workspace package.
For updating Python libraries, provide a requirements.txt or Conda environment.yml environment specification to install packages from repositories like PyPI, Conda-Forge, and more. Read the section about environment specification for further information.

You can use easily upload ODBC jar file workspace package and install directly on the Apache spark pool.

Here is an example on how to add workspace package and install directly on the Apache spark pool:

Step1: After adding your .whl file in to the workspace packages.
Note: You can add artifacts like (.whl, .jar, or .tar.gz) files to package your workspace packages

enter image description here

Step2: Associate package to spark pool package as below.

Go-To Apache Spark pool click on More option in your spark pool, then packages.

enter image description here

You will get option to upload in different ways as below. First, enable Allow session level packages option, then under Workspace packages select the file you uploaded in workspace earlier and click on apply settings.

enter image description here

If you don't have any .whl files just create requirements.txt add required packages and upload here under Requirements files. After clicking on apply settings wait for some time to successfully apply settings. Then, go to notebook stop the current session and run notebook in new spark session.

enter image description here

Note: Make sure you run notebook in new session after adding packages.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer

You need to install the ODBC Driver 17 for SQL Server on each node of your Spark cluster.

For Ubuntu/Debian :


sudo su

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -

curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list > /etc/apt/sources.list.d/mssql-release.list

exit

sudo apt-get update

sudo ACCEPT_EULA=Y apt-get install -y msodbcsql17

sudo apt-get install -y unixodbc-dev

For Red Hat/CentOS :


sudo su

curl https://packages.microsoft.com/keys/microsoft.asc | yum-config-manager --add-repo https://packages.microsoft.com/config/rhel/8/prod.repo

exit

sudo yum remove unixODBC-utf16 unixODBC-utf16-devel # Ensure to remove older versions

sudo yum install -y msodbcsql17

sudo yum install -y unixODBC-devel

After installing the driver, verify that it is correctly installed by running the following command:


odbcinst -j

This command should list the installed ODBC drivers, including "ODBC Driver 17 for SQL Server".

Then check that your connection string correctly is referencing the ODBC Driver 17 for SQL Server:


connection_string = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server_name};DATABASE={database_name}'

After installing the driver on all nodes, restart your Spark cluster to ensure all nodes pick up the new driver installation.

Make sure that the pyodbc module is installed in the environment where your Spark job is running. You can install pyodbc using pip if it is not already installed:


pip install pyodbc

Run a simple connection test to ensure everything is set up correctly:


import pyodbc

connection_string = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER=your_server_name;DATABASE=your_database_name'

try:

    conn = pyodbc.connect(connection_string)

    print("Connection successful")

except pyodbc.Error as ex:

    print("Error in connection: ", ex)

Share via

Missing ODBC drivers on spark 3.3 & 3.4 version

1 additional answer