Edit

Share via


Connect to Azure DocumentDB from Azure Databricks

This article shows you how to connect to Azure DocumentDB from Azure Databricks to perform common data operations using Python and Spark. You configure the necessary dependencies, establish a connection, and execute read, write, filter, and aggregation operations with the MongoDB Spark connector.

Prerequisites

  • An Azure subscription

    • If you don't have an Azure subscription, create a free account
  • An existing Azure DocumentDB cluster

Configure Azure Databricks workspace

Configure your Azure Databricks workspace to connect to Azure DocumentDB. Add the MongoDB Connector for Spark library to your compute to enable connectivity to Azure DocumentDB.

  1. Navigate to your Azure Databricks workspace.

  2. Configure the default compute available or create a new compute resource to run your notebook.

  3. Select a Databricks runtime that supports at least Spark 3.0.

  4. In your compute resource, select Libraries > Install New > Maven.

  5. Add the Maven coordinates: org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

  6. Select Install.

  7. Restart the compute when installation is complete.

Configure connection settings

Configure Spark to use your Azure DocumentDB connection string for all read and write operations.

  1. In the Azure portal, navigate to your Azure DocumentDB resource.

  2. Under Settings > Connection strings, copy the connection string. It has the form: mongodb+srv://<user>:<password>@<database_name>.mongocluster.cosmos.azure.com

  3. In Azure Databricks, navigate to your compute configuration and select Advanced Options (at the bottom of the page).

  4. Add the following Spark configuration variables:

    • spark.mongodb.output.uri - Paste your connection string
    • spark.mongodb.input.uri - Paste your connection string
  5. Save the configuration.

Alternatively, you can set the connection string directly in your code by using the .option() method when reading or writing data.

Create Python notebook

Run your data operations by creating a new Python notebook.

  1. In your Azure Databricks workspace, create a new Python notebook.

  2. Define your connection variables at the beginning of the notebook:

    connectionString = "mongodb+srv://<user>:<password>@<database_name>.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000"
    database = "<database_name>"
    collection = "<collection_name>"
    
  3. Replace the placeholder values with your actual database name and collection name.

Read data from collection

Read data from your Azure DocumentDB collection into a Spark DataFrame.

  1. Use the following code to load data from your collection:

    df = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString).option("collection", collection).load()
    
  2. Verify the data loaded successfully:

    df.printSchema()
    display(df)
    
  3. Observe the result. This code creates a DataFrame containing all documents from the specified collection and displays the schema and data.

Filter data

Apply filters to retrieve specific subsets of data from your collection.

  1. Use the DataFrame filter() method to apply conditions:

    df_filtered = df.filter(df["birth_year"] == 1970)
    display(df_filtered)
    
  2. Use column index numbers:

    df_filtered = df.filter(df[2] == 1970)
    display(df_filtered)
    
  3. Observe the result. This approach returns only the documents that match your filter criteria.

Query data with SQL

Create temporary views and run SQL queries against your data for familiar SQL-based analysis.

  1. Create a temporary view from your DataFrame:

    df.createOrReplaceTempView("T")
    
  2. Execute SQL queries against the view:

    df_result = spark.sql("SELECT * FROM T WHERE birth_year == 1970 AND gender == 2")
    display(df_result)
    
  3. Observe the result. This approach allows you to use standard SQL syntax for complex queries and joins.

Write data to collection

Save new or modified data by writing DataFrames back to Azure DocumentDB collections.

  1. Use the following code to write data to a collection:

    df.write.format("mongo").option("spark.mongodb.output.uri", connectionString).option("database", database).option("collection", "CitiBike2019").mode("append").save()
    
  2. The write operation completes without output. Verify that the write operation completed successfully by reading the data from the collection:

    df_verify = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString).option("collection", "CitiBike2019").load()
    display(df_verify)
    

    Tip

    Use different write modes such as append, overwrite, or ignore depending on your requirements.

Run aggregation pipelines

Execute aggregation pipelines to perform server-side data processing and analytics directly within Azure DocumentDB. Aggregation pipelines enable powerful data transformations, grouping, and calculations without moving data out of the database. They're ideal for real-time analytics, dashboards, and report generation.

  1. Define your aggregation pipeline as a JSON string:

    pipeline = "[{ $group : { _id : '$birth_year', totaldocs : { $count : 1 }, totalduration: {$sum: '$tripduration'}} }]"
    
  2. Execute the pipeline and load the results:

    df_aggregated = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString).option("collection", collection).option("pipeline", pipeline).load()
    display(df_aggregated)