How can i connect to my on premise Impala system from Azure databricks using python/pyspark code

Question

How can i connect to my on premise Impala system from Azure databricks using python/pyspark code

Anonymous

Hi Team,

I'm trying to connect to Impala EDL system from azure databricks using pyspark/python code and have installed cloudera odbc jar for impala but still unable to connect. So can you guys help me to connect to impala from databricks.

KranthiPakala-MSFT 46,647 Reputation points Microsoft Employee Moderator

2023-02-01T23:07:48.1333333+00:00

Hi there,

We still have not heard back from you. Just wanted to check if the below suggestion was helpful? If it answers your query, please do click “Accept Answer” and/or Up-Vote, as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thanks

1 answer

Your answer

KranthiPakala-MSFT 46,647 Reputation points Microsoft Employee Moderator

2023-02-01T23:07:48.1333333+00:00

Hi there,

We still have not heard back from you. Just wanted to check if the below suggestion was helpful? If it answers your query, please do click “Accept Answer” and/or Up-Vote, as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thanks

Answer 1

Hi Anonymous,

Thank you for using Microsoft Q&A forum and thank you for posting your query.

In order to connect to an on-premise Impala system from Azure Databricks using Python/PySpark code, you will need to use the pyodbc library and a JDBC connection. You can install the pyodbc library using the command !pip install pyodbc in a Databricks notebook cell.

You will also need to have the Impala JDBC driver installed on the Databricks cluster. You must download the Impala JDBC driver from the Cloudera website and upload it to your Databricks cluster. Once you have the JDBC driver installed, you can use the pyspark with spark.read.jdbc() method to connect to the Impala server.

Below is an example/sample of how you can use the spark.read.jdbc() method to connect to an Impala server and read data from a table:

Note: Please make sure to replace the placeholders <hostname>, <port>, <database>, <username>, <password>, <table_name> with your on-premise impala system details.

import pyodbc
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Impala connection").getOrCreate()

# Create connection properties
connection_url = "jdbc:impala://<hostname>:<port>/<database>"
properties = {
  "user": "<username>",
  "password": "<password>",
  "driver": "com.cloudera.impala.jdbc41.Driver"
}

# Read data from Impala table
dataframe = spark.read.jdbc(url=connection_url, table="<table_name>", properties=properties)

# Show dataframe
dataframe.show()

Important Note: Unless your database is accessible to the internet it will be unable to connect. You may need to vNet attach your databricks workspace to a vNet that has VPN or ExpressRoute connectivity to your onprem site (and correct routing in place). For more information, please refer to this Databricks documentation: Connect your Azure Databricks workspace to your on-premises network

Hope this info helps.

Thank you

Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

Share via

How can i connect to my on premise Impala system from Azure databricks using python/pyspark code

1 answer

Your answer