how to connect onprem hadoop hive to azure databricks

Prasad Sandu 0 Reputation points
2025-07-03T19:27:41.45+00:00

how to connect onprem hadoop hive to azure databricks

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,539 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,750 Reputation points Microsoft External Staff Moderator
    2025-07-03T23:15:10.4166667+00:00

    Hi Prasad Sandu
    To connect your on-premises Hadoop Hive to Azure Databricks, you can follow one of these approaches depending on your setup and security constraints:

    Option 1: Connect via JDBC (recommended for read access)

    Install Hive JDBC Driver on your Databricks cluster:

    • Upload the Hive JDBC JAR (e.g., hive-jdbc-uber.jar) to DBFS or install it using %sh or cluster libraries.
    • Example install using DBFS:
        %sh wget https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/3.1.2/hive-jdbc-3.1.2-standalone.jar -P /databricks/jars/
      

    Establish JDBC Connection: Use the following Spark read command in a notebook:

    jdbc_url = "jdbc:hive2://<onprem-host>:10000/default"
    df = spark.read.format("jdbc") \
        .option("url", jdbc_url) \
        .option("driver", "org.apache.hive.jdbc.HiveDriver") \
        .option("dbtable", "<your_table>") \     .load() 
    df.show()
    

    Network Requirements:

    • Ensure that Azure Databricks can reach your on-prem Hive server.
    • Typically done using Azure VPN Gateway or ExpressRoute, and a self-hosted integration runtime or private endpoint setup.

    Option 2: Access via Azure Data Factory + SHIR

    Use ADF with a Self-Hosted Integration Runtime (SHIR) on your on-prem environment to:

    • Ingest data from Hive to Azure (e.g., Blob or ADLS).
    • Then load into Databricks for processing.

    Option 3: Use DistCp or Hive export

    If JDBC is not ideal, you can:

    • Export Hive data to HDFS/CSV/parquet.
    • Use DistCp or AzCopy to move it to Azure Data Lake Storage.
    • Load it into Databricks via Delta Lake or Spark read.

    Notes:

    • JDBC is best for small to medium data pulls or live lookups.
    • For large-scale data, it's more efficient to stage in storage (ADLS) and process via Databricks.

    For more information you can refer the below articles:

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.