How to read and write data to hbase hdinsights with databricks

Nobuyoshi Ishizuka 0 Reputation points
2023-09-21T14:38:31.71+00:00

I'm doing a proof of concept to compare which tool best suits our company, Cosmos DB or A Hbase cluster on HDInsights: I'm trying to read and write data using Databricks on HDinsights HBase. I tried to use the SHC lib but it is only available for version 2.1 or 2.4 of Spark. I also tried using the Phoenix JDBC Driver but I couldn't. Could anyone help me?

Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
199 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,936 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 15,521 Reputation points
    2023-09-21T15:41:58.4033333+00:00

    You didn't provide detailed informations so I am assuming the following (I checked some ressources and I am trying to give you a general view)

    First try to check the JDBC URL format; it should look something like jdbc:phoenix:zookeeperQuorum:/hbase-unsecure and make sure the Phoenix JDBC jar file is installed and accessible from your Databricks notebook.

    If you want to read from HBase using Phoenix JDBC:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
        .appName("HBase_Phoenix") \
        .getOrCreate()
    # Change the JDBC URL according to your setup
    jdbc_url = "jdbc:phoenix:your_zookeeper_quorum:/hbase-unsecure"
    df = spark.read \
        .format("jdbc") \
        .option("url", jdbc_url) \
        .option("dbtable", "your_table") \
        .option("driver", "org.apache.phoenix.jdbc.PhoenixDriver") \
        .load()
    df.show()
    

    Also I agree with you that the SHC has limited Spark version support so if your Databricks cluster is running a version of Spark that's not compatible, you may have to build the SHC library from the source for your specific Spark version or just use a Databricks runtime that includes a compatible Spark version and then you can proceed with the following (I had this from an old project) :

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder \
        .appName("HBase-SHC") \
        .getOrCreate()
    
    # Catalog for mapping
    catalog = ''.join("""{
        "table":{"namespace":"default", "name":"your_table"},
        "rowkey":"key",
        "columns":{
            "col0":{"cf":"rowkey", "col":"key", "type":"string"},
            "col1":{"cf":"your_column_family", "col":"your_column", "type":"string"}
        }
    }""")
    
    # Read aata from HBase
    df = spark.read \
        .option("catalog", catalog) \
        .format('org.apache.spark.sql.execution.datasources.hbase') \
        .load()
    
    df.show()
    
    
    1 person found this answer helpful.