Missing ODBC drivers on spark 3.3 & 3.4 version

Abhiram Duvvuru 231 Reputation points Microsoft Employee
2024-07-05T20:55:19.18+00:00

Hi Team - We are upgrading our Spark version from 3.2 to 3.4 and using the pyodbc module for SQL operations. However, we are encountering the following error when attempting to connect to the Azure SQL Database. Could someone please assist us?

Error:

exception <class 'pyodbc.Error'> ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)")

Pyodbc version:

spark 3.4 -- 5.0.1

spark 3.3 -- 4.0.34

spark 3.2 -- 4.0.30

I have installed the latest version of pyodbc, 5.1.0, on Spark 3.4 and 3.3, but I am still encountering the same error.

code:

def get_copt_ss_acess_token(access_token):

""" Create packed bytes for authenticating with sql server using a token."""

tokenb = bytes(access_token, 'UTF-8')

exptoken = b''

for i in tokenb:

    exptoken += bytes({i})

    exptoken += bytes(1)

tokenstruct = struct.pack("=i", len(exptoken)) + exptoken

return tokenstruct

def get_pyodbc_connnection( access_token, connection_string):

    """ Creates a pyodbc connection using an access token."""

    SQL_COPT_SS_ACCESS_TOKEN = 1256

    copt_ss_at = get_copt_ss_acess_token(access_token)

    conn = None

    retry_flag = True

    retry_count = 0

    max_retry_count = 5

    while retry_flag and retry_count < max_retry_count:

        try:

            conn = pyodbc.connect(connection_string, attrs_before={ SQL_COPT_SS_ACCESS_TOKEN: copt_ss_at })

            retry_flag = False

            return conn

        except Exception as ex:

            print('exception {0} {1}'.format(type(ex), str(ex)))

            time.sleep(300) # Retry after 300 seconds

            retry_count = retry_count + 1

    if retry_count >= max_retry_count:

        errormsg = 'Failed to connect to sql with retries'

        print(errormsg)

        raise Exception(errormsg)

def execute_sql_statement(sql_statement):

    """ execute sql statement"""

    access_token = mssparkutils.credentials.getConnectionStringOrCreds("LinkedServiceName")

    connection_string = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server_name};DATABASE={database_name}'

    conn =  get_pyodbc_connnection(access_token, connection_string)

    cur = conn.cursor()

    try:         

        cur.execute(sql_statement)

        conn.commit()

        print('Transactions Committed')

    except Exception as e:

        print('roll back', e)

        cur.rollback()

        raise e

    finally:

        print('Close the Cursor and Connection')

        cur.close()

        conn.close()

        

def execute_sql_stored_procedure(stored_procedure_name):

""" execute stored procedure"""

execute_sql_statement('EXEC {0}'.format(stored_procedure_name))

sqlstoredprocedurename = '[dbo].[usp_test]'

execute_sql_stored_procedure(f'{sqlstoredprocedurename}')

Thanks,

Abhiram

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,698 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 85,826 Reputation points Microsoft Employee
    2024-07-11T07:49:33.02+00:00

    @Abhiram Duvvuru - Thanks for the question and using MS Q&A platform.

    There are two primary ways to install a library on a Spark pool:

    • Install a workspace library that has been uploaded as a workspace package.
    • For updating Python libraries, provide a requirements.txt or Conda environment.yml environment specification to install packages from repositories like PyPI, Conda-Forge, and more. Read the section about environment specification for further information.

    You can use easily upload ODBC jar file workspace package and install directly on the Apache spark pool.

    Here is an example on how to add workspace package and install directly on the Apache spark pool:

    Step1: After adding your .whl file in to the workspace packages.
    Note: You can add artifacts like (.whl, .jar, or .tar.gz) files to package your workspace packages

    enter image description here

    Step2: Associate package to spark pool package as below.

    Go-To Apache Spark pool click on More option in your spark pool, then packages.

    enter image description here

    You will get option to upload in different ways as below. First, enable Allow session level packages option, then under Workspace packages select the file you uploaded in workspace earlier and click on apply settings.

    enter image description here

    If you don't have any .whl files just create requirements.txt add required packages and upload here under Requirements files. After clicking on apply settings wait for some time to successfully apply settings. Then, go to notebook stop the current session and run notebook in new spark session.

    enter image description here

    Note: Make sure you run notebook in new session after adding packages.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


1 additional answer

Sort by: Most helpful
  1. Amira Bedhiafi 20,336 Reputation points
    2024-07-06T13:36:52.07+00:00

    You need to install the ODBC Driver 17 for SQL Server on each node of your Spark cluster.

    For Ubuntu/Debian :

    
    sudo su
    
    curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
    
    curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
    
    exit
    
    sudo apt-get update
    
    sudo ACCEPT_EULA=Y apt-get install -y msodbcsql17
    
    sudo apt-get install -y unixodbc-dev
    

    For Red Hat/CentOS :

    
    sudo su
    
    curl https://packages.microsoft.com/keys/microsoft.asc | yum-config-manager --add-repo https://packages.microsoft.com/config/rhel/8/prod.repo
    
    exit
    
    sudo yum remove unixODBC-utf16 unixODBC-utf16-devel # Ensure to remove older versions
    
    sudo yum install -y msodbcsql17
    
    sudo yum install -y unixODBC-devel
    

    After installing the driver, verify that it is correctly installed by running the following command:

    
    odbcinst -j
    
    

    This command should list the installed ODBC drivers, including "ODBC Driver 17 for SQL Server".

    Then check that your connection string correctly is referencing the ODBC Driver 17 for SQL Server:

    
    connection_string = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server_name};DATABASE={database_name}'
    

    After installing the driver on all nodes, restart your Spark cluster to ensure all nodes pick up the new driver installation.

    Make sure that the pyodbc module is installed in the environment where your Spark job is running. You can install pyodbc using pip if it is not already installed:

    
    pip install pyodbc
    

    Run a simple connection test to ensure everything is set up correctly:

    
    import pyodbc
    
    connection_string = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER=your_server_name;DATABASE=your_database_name'
    
    try:
    
        conn = pyodbc.connect(connection_string)
    
        print("Connection successful")
    
    except pyodbc.Error as ex:
    
        print("Error in connection: ", ex)