Migrate to Azure Managed Instance for Apache Cassandra using Apache Spark
Article
Where possible, we recommend using Apache Cassandra native replication to migrate data from your existing cluster into Azure Managed Instance for Apache Cassandra by configuring a hybrid cluster. This approach will use Apache Cassandra's gossip protocol to replicate data from your source data-center into your new managed instance datacenter. However, there may be some scenarios where your source database version isn't compatible, or a hybrid cluster setup is otherwise not feasible.
This tutorial describes how to migrate data to Migrate to Azure Managed Instance for Apache Cassandra in an offline fashion using the Cassandra Spark Connector, and Azure Databricks for Apache Spark.
Ensure you've already migrated the keyspace/table scheme from your source Cassandra database to your target Cassandra Managed Instance database.
Provision an Azure Databricks cluster
We recommend selecting Databricks runtime version 7.5, which supports Spark 3.0.
Add dependencies
Add the Apache Spark Cassandra Connector library to your cluster to connect to both native and Azure Cosmos DB Cassandra endpoints. In your cluster, select Libraries > Install New > Maven, and then add com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0 in Maven coordinates.
Select Install, and then restart the cluster when installation is complete.
Note
Make sure that you restart the Databricks cluster after the Cassandra Connector library has been installed.
Create Scala Notebook for migration
Create a Scala Notebook in Databricks. Replace your source and target Cassandra configurations with the corresponding credentials, and source and target keyspaces and tables. Then run the following code:
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Learn about Azure Managed Instance for Apache Cassandra. This service manages the deployment and scaling of native open-source instances of Apache Cassandra in Azure.
Frequently asked questions about Azure Managed Instance for Apache Cassandra. This article addresses questions on when to use managed instances, benefits, throughput limits, supported regions, and other configuration details.
Learn about the management operations supported by Azure Managed Instance for Apache Cassandra. It also explains separation of responsibilities between the Azure support team and customers when maintaining standalone and hybrid clusters.
Learn about database security best practices and key features offered by Azure Managed Instance for Apache Cassandra. These help you prevent, detect, and respond to database breaches.