Access Azure Cosmos DB for Apache Cassandra data from Azure Databricks
APPLIES TO: Cassandra
This article details how to work with Azure Cosmos DB for Apache Cassandra from Spark on Azure Databricks.
Prerequisites
Review the basics of connecting to Azure Cosmos DB for Apache Cassandra
API for Cassandra instance configuration for Cassandra connector:
The connector for API for Cassandra requires the Cassandra connection details to be initialized as part of the spark context. When you launch a Databricks notebook, the spark context is already initialized, and it isn't advisable to stop and reinitialize it. One solution is to add the API for Cassandra instance configuration at a cluster level, in the cluster spark configuration. It's one-time activity per cluster. Add the following code to the Spark configuration as a space separated key value pair:
spark.cassandra.connection.host YOUR_COSMOSDB_ACCOUNT_NAME.cassandra.cosmosdb.azure.com spark.cassandra.connection.port 10350 spark.cassandra.connection.ssl.enabled true spark.cassandra.auth.username YOUR_COSMOSDB_ACCOUNT_NAME spark.cassandra.auth.password YOUR_COSMOSDB_KEY
Add the required dependencies
Cassandra Spark connector: - To integrate Azure Cosmos DB for Apache Cassandra with Spark, the Cassandra connector should be attached to the Azure Databricks cluster. To attach the cluster:
- Review the Databricks runtime version, the Spark version. Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. We recommend selecting Databricks runtime version 10.4 LTS, which supports Spark 3.2.1. To add the Apache Spark Cassandra Connector, your cluster, select Libraries > Install New > Maven, and then add
com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0
in Maven coordinates. If using Spark 2.x, we recommend an environment with Spark version 2.4.5, using spark connector at maven coordinatescom.datastax.spark:spark-cassandra-connector_2.11:2.4.3
.
- Review the Databricks runtime version, the Spark version. Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. We recommend selecting Databricks runtime version 10.4 LTS, which supports Spark 3.2.1. To add the Apache Spark Cassandra Connector, your cluster, select Libraries > Install New > Maven, and then add
Azure Cosmos DB for Apache Cassandra-specific library: - If you're using Spark 2.x, a custom connection factory is required to configure the retry policy from the Cassandra Spark connector to Azure Cosmos DB for Apache Cassandra. Add the
com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0
maven coordinates to attach the library to the cluster.
Note
If you are using Spark 3.x, you do not need to install the Azure Cosmos DB for Apache Cassandra-specific library mentioned above.
Warning
The Spark 3 samples shown in this article have been tested with Spark version 3.2.1 and the corresponding Cassandra Spark Connector com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0. Later versions of Spark and/or the Cassandra connector may not function as expected.
Sample notebooks
A list of Azure Databricks sample notebooks is available in GitHub repo for you to download. These samples include how to connect to Azure Cosmos DB for Apache Cassandra from Spark and perform different CRUD operations on the data. You can also import all the notebooks into your Databricks cluster workspace and run it.
Accessing Azure Cosmos DB for Apache Cassandra from Spark Scala programs
Spark programs to be run as automated processes on Azure Databricks are submitted to the cluster by using spark-submit) and scheduled to run through the Azure Databricks jobs.
The following are links to help you get started building Spark Scala programs to interact with Azure Cosmos DB for Apache Cassandra.
- How to connect to Azure Cosmos DB for Apache Cassandra from a Spark Scala program
- How to run a Spark Scala program as an automated job on Azure Databricks
- Complete list of code samples for working with API for Cassandra
Next steps
Get started with creating a API for Cassandra account, database, and a table by using a Java application.