How to read and write data to hbase hdinsights with databricks

Question

I'm doing a proof of concept to compare which tool best suits our company, Cosmos DB or A Hbase cluster on HDInsights: I'm trying to read and write data using Databricks on HDinsights HBase. I tried to use the SHC lib but it is only available for version 2.1 or 2.4 of Spark. I also tried using the Phoenix JDBC Driver but I couldn't. Could anyone help me?

Answer

You didn't provide detailed informations so I am assuming the following (I checked some ressources and I am trying to give you a general view)

First try to check the JDBC URL format; it should look something like jdbc:phoenix:zookeeperQuorum:/hbase-unsecure and make sure the Phoenix JDBC jar file is installed and accessible from your Databricks notebook.

If you want to read from HBase using Phoenix JDBC:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("HBase_Phoenix") \
    .getOrCreate()
# Change the JDBC URL according to your setup
jdbc_url = "jdbc:phoenix:your_zookeeper_quorum:/hbase-unsecure"
df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", "your_table") \
    .option("driver", "org.apache.phoenix.jdbc.PhoenixDriver") \
    .load()
df.show()

Also I agree with you that the SHC has limited Spark version support so if your Databricks cluster is running a version of Spark that's not compatible, you may have to build the SHC library from the source for your specific Spark version or just use a Databricks runtime that includes a compatible Spark version and then you can proceed with the following (I had this from an old project) :

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HBase-SHC") \
    .getOrCreate()

# Catalog for mapping
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"your_table"},
    "rowkey":"key",
    "columns":{
        "col0":{"cf":"rowkey", "col":"key", "type":"string"},
        "col1":{"cf":"your_column_family", "col":"your_column", "type":"string"}
    }
}""")

# Read aata from HBase
df = spark.read \
    .option("catalog", catalog) \
    .format('org.apache.spark.sql.execution.datasources.hbase') \
    .load()

df.show()

Share via

How to read and write data to hbase hdinsights with databricks

1 answer

Your answer