Access Azure Cosmos DB for Apache Cassandra from Spark on YARN with HDInsight
Article
APPLIES TO:
Cassandra
This article covers how to access Azure Cosmos DB for Apache Cassandra from Spark on YARN with HDInsight-Spark from spark-shell. HDInsight is Microsoft's Hortonworks Hadoop PaaS on Azure. It uses object storage for HDFS and comes in several flavors, including Spark. While this article refers to HDInsight-Spark, it applies to all Hadoop distributions.
API for Cassandra configuration in Spark2. The Spark connector for Cassandra requires that the Cassandra connection details to be initialized as part of the Spark context. When you launch a Jupyter notebook, the spark session and context are already initialized. Don't stop and reinitialize the Spark context unless it's complete with every configuration set as part of the HDInsight default Jupyter notebook start-up. One workaround is to add the Cassandra instance details to Ambari, Spark2 service configuration, directly. This approach is a one-time activity per cluster that requires a Spark2 service restart.
Go to Ambari, Spark2 service and select configs.
Go to custom spark2-defaults and add a new property with the following, and restart Spark2 service:
//1) Create table if it does not exist
val cdbConnector = CassandraConnector(sc)
cdbConnector.withSessionDo(session => session.execute("CREATE TABLE IF NOT EXISTS books_ks.books(book_id TEXT PRIMARY KEY,book_author TEXT, book_name TEXT,book_pub_year INT,book_price FLOAT) WITH cosmosdb_provisioned_throughput=4000;"))
//2) Delete data from potential prior runs
cdbConnector.withSessionDo(session => session.execute("DELETE FROM books_ks.books WHERE book_id IN ('b00300','b00001','b00023','b00501','b09999','b01001','b00999','b03999','b02999','b000009');"))
//3) Generate a few rows
val booksDF = Seq(
("b00001", "Arthur Conan Doyle", "A study in scarlet", 1887,11.33),
("b00023", "Arthur Conan Doyle", "A sign of four", 1890,22.45),
("b01001", "Arthur Conan Doyle", "The adventures of Sherlock Holmes", 1892,19.83),
("b00501", "Arthur Conan Doyle", "The memoirs of Sherlock Holmes", 1893,14.22),
("b00300", "Arthur Conan Doyle", "The hounds of Baskerville", 1901,12.25)
).toDF("book_id", "book_author", "book_name", "book_pub_year","book_price")
//4) Persist
booksDF.write.mode("append").format("org.apache.spark.sql.cassandra").options(Map( "table" -> "books", "keyspace" -> "books_ks", "output.consistency.level" -> "ALL", "ttl" -> "10000000")).save()
//5) Read the data in the table
spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "books", "keyspace" -> "books_ks")).load.show
Access Azure Cosmos DB for Apache Cassandra from Jupyter notebooks
HDInsight-Spark comes with Zeppelin and Jupyter notebook services. They're both web-based notebook environments that support Scala and Python. Notebooks are great for interactive exploratory analytics and collaboration, but not meant for operational or production processes.
The following Jupyter notebooks can be uploaded into your HDInsight Spark cluster and provide ready samples for working with Azure Cosmos DB for Apache Cassandra. Be sure to review the first notebook 1.0-ReadMe.ipynb to review Spark service configuration for connecting to Azure Cosmos DB for Apache Cassandra.
When you launch Jupyter, navigate to Scala. Create a directory and then upload the notebooks to the directory. The Upload button is on the top, right-hand side.
How to run
Go through the notebooks, and each notebook cell sequentially. Select the Run button at the top of each notebook to run all cells, or Shift+Enter for each cell.
Access with Azure Cosmos DB for Apache Cassandra from your Spark Scala program
For automated processes in production, Spark programs are submitted to the cluster by using spark-submit.
Apache Spark is a core technology for large-scale data analytics. Microsoft Fabric provides support for Spark clusters, enabling you to analyze and process data at scale.