Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
APPLIES TO:
NoSQL
This article focuses on data migration from Amazon DynamoDB to Azure Cosmos DB for NoSQL. Before diving in, it’s important to understand the difference between application and data migration.
Data migration phase will likely have many sub steps including exporting data from the source system (DynamoDB in this case), additional processing such as transformations, and finally writing it into Azure Cosmos DB. On the other hand, application migration includes refactoring your application to use Azure Cosmos DB instead of DynamoDB. This could include adapting and rewriting queries, redesigning partitioning strategies, indexing, consistency, and other components. Depending on your requirements, data migration can be done in parallel to application migration, but it’s often a prerequisite to it.
Important
Refer to Migrate your application from Amazon DynamoDB to Azure Cosmos DB to dive into application migration.
There are various migration strategies available; two frequently utilized techniques are offline and online migration. The selection should be based on your specific requirements. It's also possible to implement either one independently or employ a combination of both approaches.
Tip
You could also follow an approach where data is migrated in bulk using an offline process and then switch to an online mode. This might be suitable if you have a need to (temporarily) continue using DynamoDB in parallel with Azure Cosmos DB and want the data to be synchronized in real-time.
The approach followed in this article is just one of the many ways to migrate data from DynamoDB to Azure Cosmos DB, and it has its own set of pros and cons. Here is a (non-exhaustive) list of options:
Approach | Pros | Cons |
---|---|---|
Export from DynamoDB to S3, load to ADLS Gen2 (using ADF), write to Azure Cosmos DB (using Spark on Azure Databricks) | Decouples storage and processing. Spark provides scalability and flexibility (additional data transformations, processing) | Multi-stage process increases complexity, and overall latency. Requires knowledge of Spark (learning curve). |
Export from DynamoDB to S3, use ADF to read from S3 and write to Azure Cosmos DB | Low/No-code approach (Spark skillset not required). Suitable for simple data transformations. | Complex transformation maybe difficult to implement. |
Use Spark on Azure Databricks to read from DynamoDB and write to Azure Cosmos DB | Fit for small datasets - direct processing avoids extra storage costs. Supports complex transformations (Spark). | Higher cost on DynamoDB side due to RCU consumption (S3 export not used). Requires knowledge of Spark (learning curve). |
Online migration generally uses a Change-Data-Capture (CDC) mechanism to stream data changes from DynamoDB. These often tend to be real-time (or near real-time), and you will need to build another component to process the streaming data and write it to Azure Cosmos DB. Here is a (non-exhaustive) list of options:
Approach | Pros | Cons |
---|---|---|
DynamoDB change data capture with DynamoDB Streams, process using AWS Lambda and write to Azure Cosmos DB | DynamoDB Streams provides ordering guarantee. Event-driven processing. Suitable for simple data transformations. | DynamoDB Streams data retention for 24 hours. Need to write custom logic (Lambda function). |
DynamoDB change data capture with Kinesis Data Streams, process using Kinesis or Flink and write to Azure Cosmos DB | Supports complex data transformations (windowing/aggregation with Flink), better control over processing. Retention is flexible (from 24 hours, extendable to 365 days) | No ordering guarantee. Need to write custom logic (Flink job, Kinesis data stream consumer). Requires stream processing expertise |
Important
It is recommended that you thoroughly evaluate and test the options using a proof-of-concept phase before actual migration. This will help assess complexity, feasibility, and fine tune your migration plan.
This section covers how to use Azure Data Factory, Azure Data Lake Storage, and Spark on Azure Databricks for data migration.
Before you proceed, make sure you complete the following:
Tip
If you're looking to try this out in a new DynamoDB table, you can use this data loader utility to populate your table with sample data.
DynamoDB S3 export is a built-in solution for exporting DynamoDB data to an Amazon S3 bucket. Follow the DynamoDB documentation for steps on how to execute this process, including setting up necessary S3 permissions. DynamoDB supports DynamoDB JSON and Amazon Ion as the file formats for exported data.
Note
This sample uses data exported in DynamoDB JSON format.
Once data is exported to a S3 bucket, proceed to the next step.
Clone the GitHub repository to your local machine. It contains the Azure Data Factory pipeline template and the Spark notebook used later on in this article.
git clone https://github.com/AzureCosmosDB/migration-dynamodb-to-cosmosdb-nosql
In the Azure portal, navigate to the Azure data factory created earlier, select Launch Studio to open Azure Data Factory Studio, and complete the following steps:
Use the Azure portal to create a new pipeline. A Data Factory pipeline is a logical grouping of activities that together perform a task.
In the configuration, select the linked services that you created for Amazon S3 and ADLS Gen2, and choose Use this template to create the pipeline.
Select the pipeline, navigate to Source, and edit the source (Amazon S3) dataset.
In File path, enter the path to the exported files in your Amazon S3 bucket.
Important
You can also edit the sink dataset to update the name of the storage container in which the data from S3 will be stored. If not, the container is named s3datacopy
by default.
Once the changes are complete, choose Publish all to publish the pipeline. To trigger the pipeline manually:
As the pipeline continues to execute, you can monitor it. Once it completes successfully, check the list of containers in the Azure Storage account created earlier.
Verify that a new container was created along with the contents of S3 bucket.
This section covers how to use the Azure Cosmos DB Spark connector to write data in Azure Cosmos DB. Azure Cosmos DB OLTP Spark connector provides Apache Spark support for Azure Cosmos DB NoSQL API. It allows you to read from and write to Azure Cosmos DB via Apache Spark DataFrames in Python and Scala.
Start by creating an Azure Databricks workspace. Make sure to review the compatibility matrix in terms of versions of various components including the Azure Cosmos DB Spark connector, Apache Spark, JVM, Scala, and Databricks Runtime. Refer to this documentation for an exhaustive list.
Once the Databricks workspace is created, follow the documentation to install the appropriate connector version. The rest of the steps in this article work with the connector version 4.36.0
with Spark 3.5.0
on Databricks 15.4
(with Scala 2.12
). Here are the Maven coordinates of the connector – com.azure.cosmos.spark:azure-cosmos-spark_3-5_2-12:4.36.0
The GitHub repository contains a notebook (migration.ipynb
) with the Spark code to read data from ADLS Gen2 and write it to Azure Cosmos DB. Import the notebook into your Databricks workspace.
Use OAuth 2.0 credentials with Microsoft Entra ID service principals to connect to Azure storage from Azure Databricks. Follow the steps in the documentation to complete these steps:
Storage Blob Data Reader
role to the Microsoft Entra ID application you had created.Follow these steps to configure Microsoft Entra ID authentication for Azure Cosmos DB:
In the Azure Cosmos DB account, under Access control (IAM), assign the Cosmos DB Operator
role to the Microsoft Entra ID application you had created.
Use the Azure CLI to create the Azure Cosmos DB role definition and get the role definition ID (if you don’t have the Azure CLI setup, you can choose to use Azure Cloud Shell directly from the Azure portal instead).
az cosmosdb sql role definition create --resource-group "<resource-group-name>" --account-name "<account-name>" --body '{
"RoleName": "<role-definition-name>",
"Type": "CustomRole",
"AssignableScopes": ["/"],
"Permissions": [{
"DataActions": [
"Microsoft.DocumentDB/databaseAccounts/readMetadata",
"Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*",
"Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*"
]
}]
}'
// List the role definition you created to fetch its unique identifier in the JSON output. Record the id value of the JSON output.
az cosmosdb sql role definition list --resource-group "<resource-group-name>" --account-name "<account-name>"
Once you have created the role definition and obtained the role definition ID, use the following command to get service principal ID associated with the Microsoft Entra ID application. Replace AppId
with client ID of the Microsoft Entra ID application:
SP_ID=$(az ad sp list --filter "appId eq '{AppId}'" | jq -r '.[0].id')
Now, create the role assignment using the below command. Make sure to replace resource group name, Azure Cosmos DB account name, and the role definition ID.
az cosmosdb sql role assignment create --resource-group <enter resource group name> --account-name <enter cosmosdb account name> --scope "/" --principal-id $SP_ID --role-definition-id <enter role definition ID> --scope "/"
Run the first two steps to install required dependencies:
pip install azure-cosmos azure-mgmt-cosmosdb azure.mgmt.authorization
dbutils.library.restartPython()
The third step reads DynamoDB data from ADLS Gen2 and stores it in a data frame. Before running it, replace the following information with the corresponding values in your setup:
Variable | Description |
---|---|
storage_account_name |
Azure storage account name |
container_name |
Azure storage container name, for example, s3datacopy |
file_path |
Path to the folder containing the exported JSON file(s) in the Azure storage container. For example, AWSDynamoDB/01738047791106-7ba095a9/data/* |
client_id |
The application (client) ID of the Microsoft Entra ID application (found on the Overview page) |
tenant_id |
The directory (tenant) ID of the Microsoft Entra ID application (found on the Overview page) |
client_secret |
Value of the client secret associated with the Microsoft Entra ID application (found in Certificates & secrets) |
Note
If necessary, you can run the next cell (step 4) to execute any data transformations or implement custom logic. For example, this could be adding a id
field to your data before writing it to Azure Cosmos DB.
Run step 5 to create the Azure Cosmos DB database and container. This is done using the Catalog API of the Azure Cosmos DB Spark connector. Replace the following information with the corresponding values in your setup:
Variable | Description |
---|---|
cosmosEndpoint |
URI of the Cosmos DB account |
cosmosDatabaseName |
Name of the Cosmos DB database you want to create |
cosmosContainerName |
Name of the Cosmos DB container you want to create |
subscriptionId |
Azure Subscription ID |
resourceGroupName |
Cosmos DB resource group name |
partitionKeyPath |
Partition key for the container, for example, /id |
throughput |
Container throughput, for example, 1000 . Be mindful of the throughput you associate with the container – you may need to adjust this depending on the volume of data to be migrated. |
client_id |
The application (client) ID of the Microsoft Entra ID application (found on the Overview page) |
tenant_id |
The directory (tenant) ID of the Microsoft Entra ID application (found on the Overview page) |
client_secret |
Value of the client secret associated with the Microsoft Entra ID application (found in Certificates & secrets) |
Finally, run the last (step 6) to write data to Azure Cosmos DB. Replace the following information with the corresponding values in your setup:
Variable | Description |
---|---|
cosmosEndpoint |
URI of the Cosmos DB account |
cosmosDatabaseName |
Name of the Cosmos DB database you want to create |
cosmosContainerName |
Name of the Cosmos DB container you want to create |
subscriptionId |
Azure Subscription ID |
resourceGroupName |
Cosmos DB resource group name |
client_id |
The application (client) ID of the Microsoft Entra ID application (found on the Overview page) |
tenant_id |
The directory (tenant) ID of the Microsoft Entra ID application (found on the Overview page) |
client_secret |
Value of the client secret associated with the Microsoft Entra ID application (found in Certificates & secrets) |
After the cell execution completes, check the Azure Cosmos DB container to verify that the data has been migrated successfully.
Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowTraining
Module
Move data into and out of Azure Cosmos DB for NoSQL - Training
Migrate data into and out of Azure Cosmos DB for NoSQL using Azure services and open-source solutions.
Certification
Microsoft Certified: Azure Cosmos DB Developer Specialty - Certifications
Write efficient queries, create indexing policies, manage, and provision resources in the SQL API and SDK with Microsoft Azure Cosmos DB.