Error while trying to connect to on Prem impala from databricks to read a table using pyspark

29577539 0 Reputation points
2023-06-16T04:58:34.8433333+00:00

I am using 11.3 lts runtime version of azure databricks and impala cluster version is 3.4.0-Snapshot.With this i have installed the impala jdbc version :2.6.4.1005 in the cluster and the code I am using as below:df=spark.read.format("jdbc").option("url","jdbc:impala://hostname:portname/database).option("dbtable","tblnm").option("user",usname).option("password",PWD).option('driver','com.cloudera.impala.jdbc41.Driver').load() after execution it's throwing error as Simaba Impalajdbcdriver 700110 Unexpected session error:Java.lang.NoclassDefFoundError and if we install some other version of impala and in option we pass option('driver','com.cloudera.impala.jdbc.Driver').load() it gives error as Communication link failure.Could you please suggest what might be causing the issue as the port is already opened and we are able to connect to impala through pyodbc.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,534 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,651 Reputation points Moderator
    2023-06-19T08:29:21.21+00:00

    @29577539 - Thanks for the question and using MS Q&A platform.

    The error message you are seeing indicates that there is a communication issue between Databricks and your on-premises Impala instance. Here are some steps you can take to troubleshoot and resolve the issue:

    • Check network connectivity: Ensure that there is network connectivity between Databricks and your on-premises Impala instance. You can use tools like ping or telnet to test connectivity.
    • Check firewall settings: Ensure that the necessary ports are open in the firewall settings for your on-premises Impala instance. The default port for Impala is 21050, but this may vary depending on your configuration.
    • Check Impala configuration: Ensure that Impala is configured to allow connections from Databricks. You may need to add the IP address or hostname of the Databricks cluster to the Impala configuration.
    • Check JDBC driver version: Ensure that you are using the correct version of the JDBC driver for your Impala instance. You can download the latest version of the Cloudera Impala JDBC driver from the Cloudera website.
    • Check Databricks cluster configuration: Ensure that the Databricks cluster is configured to use the correct JDBC driver and connection settings for your Impala instance. You can configure these settings in the cluster configuration settings.
    • Check Impala logs: Check the Impala logs for any errors or warnings related to the connection issue. This may provide additional information on the cause of the issue.

    By following these steps, you should be able to identify and resolve the communication issue between Databricks and your on-premises Impala instance.

    And also, checkout the MS Q&A thread: How can i connect to my on premise Impala system from Azure databricks using python/pyspark code addressing similar issue.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.