Pre-migration steps for data migrations from MongoDB to Azure Cosmos DB's API for MongoDB

APPLIES TO: MongoDB

Important

Please read this entire guide before carrying out your pre-migration steps.

This MongoDB pre-migration guide is part of series on MongoDB migration. The critical MongoDB migration steps are pre-migration, migration, and post-migration, as shown below.

Diagram of migration steps.

Overview of pre-migration

It's critical to carry out certain up-front planning and decision-making about your migration before you actually move any data. This initial decision-making process is the "pre-migration".

Your goal in pre-migration is to:

  1. Ensure that you set up Azure Cosmos DB to fulfill your application's requirements, and
  2. Plan out how you will execute the migration.

Follow these steps to perform a thorough pre-migration

  1. Discover your existing MongoDB resources and create a data estate spreadsheet to track them
  2. Assess the readiness of your existing MongoDB resources for data migration
  3. Map your existing MongoDB resources to new Azure Cosmos DB resources
  4. Plan the logistics of migration process end-to-end, before you kick off the full-scale data migration

Then, execute your migration in accordance with your pre-migration plan.

Finally, perform the critical post-migration steps of cut-over and optimization.

All of the above steps are critical for ensuring a successful migration.

When you plan a migration, we recommend that whenever possible you plan at the per-resource level.

The Database Migration Assistant(DMA) assists you with the Discovery and Assessment stages of the planning.

Pre-migration discovery

The first pre-migration step is resource discovery. In this step, you need to create a data estate migration spreadsheet.

  • This sheet contains a comprehensive list of the existing resources (databases or collections) in your MongoDB data estate.
  • The purpose of this spreadsheet is to enhance your productivity and help you to plan migration from end-to-end.
  • You're recommended to extend this document and use it as a tracking document throughout the migration process.

Programmatic discovery using the Database Migration Assistant

You may use the Database Migration Assistant (DMA) to assist you with the discovery stage and create the data estate migration sheet programmatically.

It's easy to setup and run DMA through an Azure Data Studio client. It can be run from any machine connected to your source MongoDB environment.

You can use either one of the following DMA output files as the data estate migration spreadsheet:

  • workload_database_details.csv - Gives a database-level view of the source workload. Columns in file are: Database Name, Collection count, Document Count, Average Document Size, Data Size, Index Count and Index Size.
  • workload_collection_details.csv - Gives a collection-level view of the source workload. Columns in file are: Database Name, Collection Name, Doc Count, Average Document Size, Data size, Index Count, Index Size and Index definitions.

Here's a sample database-level migration spreadsheet created by DMA: Data estate spreadsheet example

Manual discovery

Alternately, you may refer to the sample spreadsheet above and create a similar document yourself.

  • The spreadsheet should be structured as a record of your data estate resources, in list form.
  • Each row corresponds to a resource (database or collection).
  • Each column corresponds to a property of the resource; start with at least name and data size (GB) as columns.
  • As you progress through this guide, you'll build this spreadsheet into a tracking document for your end-to-end migration planning, adding columns as needed.

Here are some tools you can use for discovering resources:

Pre-migration assessment

Second, as a prelude to planning your migration, assess the readiness of resources in your data estate for migration.

Assessment involves finding out whether you're using the features and syntax that are supported. It also includes making sure you're adhering to the limits and quotas. The aim of this stage is to create a list of incompatibilities and warnings, if any. After you have the assessment results, you can try to address the findings during rest of the migration planning.

Programmatic assessment using the Database Migration Assistant

Database Migration Assistant (DMA) also assists you with the assessment stage of pre-migration planning.

Refer to the section Programmatic discovery using the Database Migration Assistant to know how to setup and run DMA.

The DMA notebook runs a few assessment rules against the resource list it gathers from source MongoDB. The assessment result lists the required and recommended changes needed to proceed with the migration.

The results are printed as an output in the DMA notebook and saved to a CSV file - assessment_result.csv.

Note

Database Migration Assistant is a preliminary utility meant to assist you with the pre-migration steps. It does not perform an end-to-end assessment. In addition to running the DMA, we also recommend you to go through the supported features and syntax, Azure Cosmos DB limits and quotas in detail, as well as perform a proof-of-concept prior to the actual migration.

Pre-migration mapping

With the discovery and assessment steps complete, you are done with the MongoDB side of the equation. Now it is time to plan the Azure Cosmos DB side of the equation. How will you set up and configure your production Azure Cosmos DB resources? Do your planning at a per-resource level – that means you should add the following columns to your planning spreadsheet:

  • Azure Cosmos DB mapping
  • Shard key
  • Data model
  • Dedicated vs shared throughput

More detail is provided in the following sections.

Capacity planning

Trying to do capacity planning for a migration to Azure Cosmos DB?

Considerations when using Azure Cosmos DB's API for MongoDB

Before you plan your Azure Cosmos DB data estate, make sure you understand the following Azure Cosmos DB concepts:

  • Capacity model: Database capacity on Azure Cosmos DB is based on a throughput-based model. This model is based on Request Units per second, which is a unit that represents the number of database operations that can be executed against a collection on a per-second basis. This capacity can be allocated at a database or collection level, and it can be provisioned on an allocation model, or using the autoscale provisioned throughput.

  • Request Units: Every database operation has an associated Request Units (RUs) cost in Azure Cosmos DB. When executed, this is subtracted from the available request units level on a given second. If a request requires more RUs than the currently allocated RU/s there are two options to solve the issue - increase the amount of RUs, or wait until the next second starts and then retry the operation.

  • Elastic capacity: The capacity for a given collection or database can change at any time. This allows for the database to elastically adapt to the throughput requirements of your workload.

  • Automatic sharding: Azure Cosmos DB provides an automatic partitioning system that only requires a shard (or a partition key). The automatic partitioning mechanism is shared across all the Azure Cosmos DB APIs and it allows for seamless data and throughout scaling through horizontal distribution.

Plan the Azure Cosmos DB data estate

Figure out what Azure Cosmos DB resources you'll create. This means stepping through your data estate migration spreadsheet and mapping each existing MongoDB resource to a new Azure Cosmos DB resource.

  • Anticipate that each MongoDB database will become an Azure Cosmos DB database.
  • Anticipate that each MongoDB collection will become an Azure Cosmos DB collection.
  • Choose a naming convention for your Azure Cosmos DB resources. Barring any change in the structure of databases and collections, keeping the same resource names is usually a fine choice.
  • Determine whether you'll be using sharded or unsharded collections in Azure Cosmos DB. The unsharded collection limit is 20 GB. Sharding, on the other hand, helps achieve horizontal scale that is critical to the performance of many workloads.
  • If using sharded collections, do not assume that your MongoDB collection shard key becomes your Azure Cosmos DB collection shard key. Do not assume that your existing MongoDB data model/document structure is what you'll employ on Azure Cosmos DB.
    • Shard key is the single most important setting for optimizing the scalability and performance of Azure Cosmos DB, and data modeling is the second most important. Both of these settings are immutable and cannot be changed once they are set; therefore it is highly important to optimize them in the planning phase. Follow the guidance in the Immutable decisions section for more information.
  • Azure Cosmos DB does not recognize certain MongoDB collection types such as capped collections. For these resources, just create normal Azure Cosmos DB collections.
  • Azure Cosmos DB has two collection types of its own – shared and dedicated throughput. Shared vs dedicated throughput is another critical, immutable decision which it is vital to make in the planning phase. Follow the guidance in the Immutable decisions section for more information.

Immutable decisions

The following Azure Cosmos DB configuration choices cannot be modified or undone once you have created an Azure Cosmos DB resource; therefore it is important to get these right during pre-migration planning, before you kick off any migrations:

Cost of ownership

Estimating throughput

  • In Azure Cosmos DB, the throughput is provisioned in advance and is measured in Request Units (RUs) per second. Unlike VMs or on-premises servers, RUs are easy to scale up and down at any time. You can change the number of provisioned RUs instantly. For more information, see Request units in Azure Cosmos DB.

  • You can use the Azure Cosmos DB Capacity Calculator to determine the amount of Request Units based on your database account configuration, amount of data, document size, and required reads and writes per second.

  • The following are key factors that affect the number of required RUs:

    • Document size: As the size of an item/document increases, the number of RUs consumed to read or write the item/document also increases.

    • Document property count:The number of RUs consumed to create or update a document is related to the number, complexity and length of its properties. You can reduce the request unit consumption for write operations by limiting the number of indexed properties.

    • Query patterns: The complexity of a query affects how many request units are consumed by the query.

  • The best way to understand the cost of queries is to use sample data in Azure Cosmos DB, and run sample queries from the MongoDB Shell using the getLastRequestStastistics command to get the request charge, which will output the number of RUs consumed:

    db.runCommand({getLastRequestStatistics: 1})

    This command will output a JSON document similar to the following:

    { "_t": "GetRequestStatisticsResponse", "ok": 1, "CommandName": "find", "RequestCharge": 10.1, "RequestDurationInMilliSeconds": 7.2}

  • You can also use the diagnostic settings to understand the frequency and patterns of the queries executed against Azure Cosmos DB. The results from the diagnostic logs can be sent to a storage account, an EventHub instance or Azure Log Analytics.

Pre-migration logistics planning

Finally, now that you have a view of your existing data estate and a design for your new Azure Cosmos DB data estate, you are ready to plan how to execute your migration process end-to-end. Once again, do your planning at a per-resource level, adding columns to your spreadsheet to capture the logistic dimensions below.

Execution logistics

  • Assign responsibility for migrating each existing resource from MongoDB to Azure Cosmos DB. How you leverage your team resources in order to shepherd your migration to completion is up to you. For small migrations, you can have one team kick off the entire migration and monitor its progress. For larger migrations, you could assign responsibility to team-members on a per-resource basis for migrating and monitoring that resource.

  • Once you have assigned responsibility for migrating your resources, now you should choose the right migration tool(s) for migration. For small migrations, you might be able to use one migration tool such as a MongoDB native tool or Azure DMS to migrate all of your resources in one shot. For larger migrations or migrations with special requirements, you may want to choose migration tooling at a per-resource granularity.

    • Before you plan which migration tools to use, we recommend acquainting yourself with the options that are available. The Azure Database Migration Service for Azure Cosmos DB's API for MongoDB provides a mechanism that simplifies data migration by providing a fully managed hosting platform, migration monitoring options and automatic throttling handling. The full list of options are the following:
    Migration type Solution Considerations
    Online Azure Database Migration Service • Makes use of the Azure Cosmos DB bulk executor library
    • Suitable for large datasets and takes care of replicating live changes
    • Works only with other MongoDB sources
    Offline Azure Database Migration Service • Makes use of the Azure Cosmos DB bulk executor library
    • Suitable for large datasets and takes care of replicating live changes
    • Works only with other MongoDB sources
    Offline Azure Data Factory • Easy to set up and supports multiple sources
    • Makes use of the Azure Cosmos DB bulk executor library
    • Suitable for large datasets
    • Lack of checkpointing means that any issue during the course of migration would require a restart of the whole migration process
    • Lack of a dead letter queue would mean that a few erroneous files could stop the entire migration process
    • Needs custom code to increase read throughput for certain data sources
    Offline Existing Mongo Tools (mongodump, mongorestore, Studio3T) • Easy to set up and integration
    • Needs custom handling for throttles
    Offline/online Azure Databricks and Spark • Full control of migration rate and data transformation
    • Requires custom coding
    • If your resource can tolerate an offline migration, use the diagram below to choose the appropriate migration tool:

    Offline migration tools.

    • If your resource requires an online migration, use the diagram below to choose the appropriate migration tool:

    Online migration tools.

    Watch this video for an overview and demo of the migration solutions mentioned above.

  • Once you have chosen migration tools for each resource, the next step is to prioritize the resources you will migrate. Good prioritization can help keep your migration on schedule. A good practice is to prioritize migrating those resources which need the most time to be moved; migrating these resources first will bring the greatest progress toward completion. Furthermore, since these time-consuming migrations typically involve more data, they are usually more resource-intensive for the migration tool and therefore are more likely to expose any problems with your migration pipeline early on. This minimizes the chance that your schedule will slip due to any difficulties with your migration pipeline.

  • Plan how you will monitor the progress of migration once it has started. If you are coordinating your data migration effort among a team, plan a regular cadence of team syncs too, so that you have a comprehensive view of how the high-priority migrations are going.

Supported migration scenarios

The best choice of MongoDB migration tool depends on your migration scenario.

Types of migrations

The compatible tools for each migration scenario are shown below:

Supported migration scenarios.

Tooling support for MongoDB versions

Given that you are migrating from a particular MongoDB version, the supported tools are shown below:

MongoDB versions supported by migration tools.

Post-migration

In the pre-migration phase, spend some time to plan what steps you will take toward app migration and optimization post-migration.

  • In the post-migration phase, you will execute a cutover of your application to use Azure Cosmos DB instead of your existing MongoDB data estate.
  • Make your best effort to plan out indexing, global distribution, consistency, and other mutable Azure Cosmos DB properties at a per resource level - however, these Azure Cosmos DB configuration settings can be modified later, so expect to make adjustments to these settings down the road. Don’t let these aspects be a cause of analysis paralysis. You will apply these mutable configurations post-migration.
  • For a post-migration guide, see Post-migration optimization steps when using Azure Cosmos DB's API for MongoDB.

Next steps