Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article shows you how to connect to Azure DocumentDB from Azure Databricks to perform common data operations using Python and Spark. You configure the necessary dependencies, establish a connection, and execute read, write, filter, and aggregation operations with the MongoDB Spark connector.
Prerequisites
An Azure subscription
- If you don't have an Azure subscription, create a free account
An existing Azure DocumentDB cluster
- If you don't have a cluster, create a new cluster
Spark environment in Azure Databricks
- MongoDB Spark connector compatible with Spark 3.2.1 or higher (available at Maven coordinates
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1)
- MongoDB Spark connector compatible with Spark 3.2.1 or higher (available at Maven coordinates
Configure Azure Databricks workspace
Configure your Azure Databricks workspace to connect to Azure DocumentDB. Add the MongoDB Connector for Spark library to your compute to enable connectivity to Azure DocumentDB.
Navigate to your Azure Databricks workspace.
Configure the default compute available or create a new compute resource to run your notebook.
Select a Databricks runtime that supports at least Spark 3.0.
In your compute resource, select Libraries > Install New > Maven.
Add the Maven coordinates:
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1Select Install.
Restart the compute when installation is complete.
Configure connection settings
Configure Spark to use your Azure DocumentDB connection string for all read and write operations.
In the Azure portal, navigate to your Azure DocumentDB resource.
Under Settings > Connection strings, copy the connection string. It has the form:
mongodb+srv://<user>:<password>@<database_name>.mongocluster.cosmos.azure.comIn Azure Databricks, navigate to your compute configuration and select Advanced Options (at the bottom of the page).
Add the following Spark configuration variables:
spark.mongodb.output.uri- Paste your connection stringspark.mongodb.input.uri- Paste your connection string
Save the configuration.
Alternatively, you can set the connection string directly in your code by using the .option() method when reading or writing data.
Create Python notebook
Run your data operations by creating a new Python notebook.
In your Azure Databricks workspace, create a new Python notebook.
Define your connection variables at the beginning of the notebook:
connectionString = "mongodb+srv://<user>:<password>@<database_name>.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000" database = "<database_name>" collection = "<collection_name>"Replace the placeholder values with your actual database name and collection name.
Read data from collection
Read data from your Azure DocumentDB collection into a Spark DataFrame.
Use the following code to load data from your collection:
df = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString).option("collection", collection).load()Verify the data loaded successfully:
df.printSchema() display(df)Observe the result. This code creates a DataFrame containing all documents from the specified collection and displays the schema and data.
Filter data
Apply filters to retrieve specific subsets of data from your collection.
Use the DataFrame
filter()method to apply conditions:df_filtered = df.filter(df["birth_year"] == 1970) display(df_filtered)Use column index numbers:
df_filtered = df.filter(df[2] == 1970) display(df_filtered)Observe the result. This approach returns only the documents that match your filter criteria.
Query data with SQL
Create temporary views and run SQL queries against your data for familiar SQL-based analysis.
Create a temporary view from your DataFrame:
df.createOrReplaceTempView("T")Execute SQL queries against the view:
df_result = spark.sql("SELECT * FROM T WHERE birth_year == 1970 AND gender == 2") display(df_result)Observe the result. This approach allows you to use standard SQL syntax for complex queries and joins.
Write data to collection
Save new or modified data by writing DataFrames back to Azure DocumentDB collections.
Use the following code to write data to a collection:
df.write.format("mongo").option("spark.mongodb.output.uri", connectionString).option("database", database).option("collection", "CitiBike2019").mode("append").save()The write operation completes without output. Verify that the write operation completed successfully by reading the data from the collection:
df_verify = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString).option("collection", "CitiBike2019").load() display(df_verify)Tip
Use different write modes such as
append,overwrite, orignoredepending on your requirements.
Run aggregation pipelines
Execute aggregation pipelines to perform server-side data processing and analytics directly within Azure DocumentDB. Aggregation pipelines enable powerful data transformations, grouping, and calculations without moving data out of the database. They're ideal for real-time analytics, dashboards, and report generation.
Define your aggregation pipeline as a JSON string:
pipeline = "[{ $group : { _id : '$birth_year', totaldocs : { $count : 1 }, totalduration: {$sum: '$tripduration'}} }]"Execute the pipeline and load the results:
df_aggregated = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString).option("collection", collection).option("pipeline", pipeline).load() display(df_aggregated)
Related content
- Maven central - MongoDB Spark connector versions
- Practical MongoDB Aggregations - Guide to aggregation pipelines
- Configure firewall settings