How can i connect to my on premise Impala system from Azure databricks using python/pyspark code

Anonymous
2023-01-25T07:00:33.1566667+00:00

Hi Team,

I'm trying to connect to Impala EDL system from azure databricks using pyspark/python code and have installed cloudera odbc jar for impala but still unable to connect. So can you guys help me to connect to impala from databricks.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,547 questions
{count} vote

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,647 Reputation points Microsoft Employee Moderator
    2023-01-25T23:47:22.78+00:00

    Hi Anonymous,

    Thank you for using Microsoft Q&A forum and thank you for posting your query.

    In order to connect to an on-premise Impala system from Azure Databricks using Python/PySpark code, you will need to use the pyodbc library and a JDBC connection. You can install the pyodbc library using the command !pip install pyodbc in a Databricks notebook cell.

    You will also need to have the Impala JDBC driver installed on the Databricks cluster. You must download the Impala JDBC driver from the Cloudera website and upload it to your Databricks cluster. Once you have the JDBC driver installed, you can use the pyspark with spark.read.jdbc() method to connect to the Impala server.

    Below is an example/sample of how you can use the spark.read.jdbc() method to connect to an Impala server and read data from a table:

    Note: Please make sure to replace the placeholders <hostname>, <port>, <database>, <username>, <password>, <table_name> with your on-premise impala system details.

    import pyodbc
    from pyspark.sql import SparkSession
    
    # Create a SparkSession
    spark = SparkSession.builder.appName("Impala connection").getOrCreate()
    
    # Create connection properties
    connection_url = "jdbc:impala://<hostname>:<port>/<database>"
    properties = {
      "user": "<username>",
      "password": "<password>",
      "driver": "com.cloudera.impala.jdbc41.Driver"
    }
    
    # Read data from Impala table
    dataframe = spark.read.jdbc(url=connection_url, table="<table_name>", properties=properties)
    
    # Show dataframe
    dataframe.show()
    
    

    Important Note: Unless your database is accessible to the internet it will be unable to connect. You may need to vNet attach your databricks workspace to a vNet that has VPN or ExpressRoute connectivity to your onprem site (and correct routing in place). For more information, please refer to this Databricks documentation: Connect your Azure Databricks workspace to your on-premises network

    Hope this info helps.

    Thank you


    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.