Edit

Share via


Quickstart: Vector search with Java in Azure DocumentDB

Learn to use vector search in Azure DocumentDB with the Java MongoDB driver to store and query vector data efficiently.

This quickstart provides a guided tour of key vector search techniques using a Java sample app on GitHub.

The app uses a sample hotel dataset in a JSON file with pre-calculated vectors from the text-embedding-3-small model, though you can also generate the vectors yourself. The hotel data includes hotel names, locations, descriptions, and vector embeddings.

Prerequisites

  • An Azure subscription

    • If you don't have an Azure subscription, create a free account

Create data file with vectors

  1. Create a new data directory for the hotels data file:

    mkdir data
    
  2. Copy the Hotels_Vector.json raw data file with vectors to your data directory.

Create a Java project

  1. Create a new sibling directory for your project, at the same level as the data directory, and open it in Visual Studio Code:

    mkdir vector-search-quickstart
    mkdir vector-search-quickstart/src
    code vector-search-quickstart
    
  2. Create a pom.xml file in the project root with the following content:

    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.azure.documentdb.samples</groupId>
        <artifactId>vector-search-quickstart</artifactId>
        <version>1.0-SNAPSHOT</version>
            
        <properties>
            <maven.compiler.release>21</maven.compiler.release>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        </properties>
            
        <dependencies>
            <dependency>
                <groupId>org.mongodb</groupId>
                <artifactId>mongodb-driver-sync</artifactId>
                <version>5.6.2</version>
            </dependency>
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-identity</artifactId>
                <version>1.18.1</version>
            </dependency>
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-ai-openai</artifactId>
                <version>1.0.0-beta.16</version>
            </dependency>
            <dependency>
                <groupId>tools.jackson.core</groupId>
                <artifactId>jackson-databind</artifactId>
                <version>3.0.3</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-nop</artifactId>
                <version>2.0.17</version>
                <scope>runtime</scope>
            </dependency>
        </dependencies>
    </project>
    

    The app uses the following Maven dependencies specified in the pom.xml:

    • mongodb-driver-sync: Official MongoDB Java driver for database connectivity and operations
    • azure-identity: Azure Identity library for passwordless authentication with Microsoft Entra ID
    • azure-ai-openai: Azure OpenAI client library to communicate with AI models and create vector embeddings
    • jackson-databind: JSON serialization and deserialization library
    • slf4j-nop: No-operation SLF4J binding to suppress logging output from the MongoDB driver
  3. Create a .env file in your project root for environment variables:

    # Azure OpenAI Embedding Settings
    AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-3-small
    AZURE_OPENAI_EMBEDDING_API_VERSION=2023-05-15
    AZURE_OPENAI_EMBEDDING_ENDPOINT=
    EMBEDDING_SIZE_BATCH=16
    
    # Azure DocumentDB configuration
    MONGO_CLUSTER_NAME=
    
    # Data file
    DATA_FILE_WITH_VECTORS=../data/Hotels_Vector.json
    EMBEDDED_FIELD=DescriptionVector
    EMBEDDING_DIMENSIONS=1536
    LOAD_SIZE_BATCH=50
    

    Replace the placeholder values in the .env file with your own information:

    • AZURE_OPENAI_EMBEDDING_ENDPOINT: Your Azure OpenAI resource endpoint URL.
    • MONGO_CLUSTER_NAME: Your Azure DocumentDB resource name.
  4. Load the environment variables:

    set -a && source .env && set +a
    
  5. The project structure should look like this:

    data
    └── Hotels_Vector.json
    vector-search-quickstart
    ├── .env
    ├── pom.xml
    └── src
    

Create a DiskAnn.java file in the src directory and paste in the following code:

package com.azure.documentdb.samples;

import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.ai.openai.models.EmbeddingsOptions;
import com.azure.identity.DefaultAzureCredentialBuilder;
import com.mongodb.ConnectionString;
import com.mongodb.MongoClientSettings;
import com.mongodb.MongoCredential;
import com.mongodb.client.AggregateIterable;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.Indexes;
import org.bson.Document;
import tools.jackson.core.type.TypeReference;
import tools.jackson.databind.json.JsonMapper;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * Vector search sample using DiskANN index.
 */
public class DiskAnn {
    private static final String SAMPLE_QUERY = "quintessential lodging near running trails, eateries, retail";
    private static final String DATABASE_NAME = "Hotels";
    private static final String COLLECTION_NAME = "hotels_diskann";
    private static final String VECTOR_INDEX_NAME = "vectorIndex_diskann";

    private final JsonMapper jsonMapper = JsonMapper.builder().build();

    public static void main(String[] args) {
        new DiskAnn().run();
        System.exit(0);
    }

    public void run() {
        try (var mongoClient = createMongoClient()) {
            var openAIClient = createOpenAIClient();

            var database = mongoClient.getDatabase(DATABASE_NAME);
            var collection = database.getCollection(COLLECTION_NAME, Document.class);

            // Drop and recreate collection
            collection.drop();
            database.createCollection(COLLECTION_NAME);
            System.out.println("Created collection: " + COLLECTION_NAME);

            // Load and insert data
            var hotelData = loadHotelData();
            insertDataInBatches(collection, hotelData);

            // Create standard indexes
            createStandardIndexes(collection);

            // Create vector index
            createVectorIndex(database);

            // Perform vector search
            var queryEmbedding = createEmbedding(openAIClient, SAMPLE_QUERY);
            performVectorSearch(collection, queryEmbedding);

        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }

    private MongoClient createMongoClient() {
        var clusterName = System.getenv("MONGO_CLUSTER_NAME");
        var managedIdentityPrincipalId = System.getenv("AZURE_MANAGED_IDENTITY_PRINCIPAL_ID");
        var azureCredential = new DefaultAzureCredentialBuilder().build();

        MongoCredential.OidcCallback callback = (MongoCredential.OidcCallbackContext context) -> {
            var token = azureCredential.getToken(
                new com.azure.core.credential.TokenRequestContext()
                    .addScopes("https://ossrdbms-aad.database.windows.net/.default")
            ).block();

            if (token == null) {
                throw new RuntimeException("Failed to obtain Azure AD token");
            }

            return new MongoCredential.OidcCallbackResult(token.getToken());
        };

        var credential = MongoCredential.createOidcCredential(null)
            .withMechanismProperty("OIDC_CALLBACK", callback);

        var connectionString = new ConnectionString(
            String.format("mongodb+srv://%s@%s.mongocluster.cosmos.azure.com/?authMechanism=MONGODB-OIDC&tls=true&retrywrites=false&maxIdleTimeMS=120000",
                managedIdentityPrincipalId, clusterName)
        );

        var settings = MongoClientSettings.builder()
            .applyConnectionString(connectionString)
            .credential(credential)
            .build();

        return MongoClients.create(settings);
    }

    private OpenAIClient createOpenAIClient() {
        var endpoint = System.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT");
        var credential = new DefaultAzureCredentialBuilder().build();

        return new OpenAIClientBuilder()
            .endpoint(endpoint)
            .credential(credential)
            .buildClient();
    }

    private List<Map<String, Object>> loadHotelData() throws IOException {
        var dataFile = System.getenv("DATA_FILE_WITH_VECTORS");
        var filePath = Path.of(dataFile);

        System.out.println("Reading JSON file from " + filePath.toAbsolutePath());
        var jsonContent = Files.readString(filePath);

        return jsonMapper.readValue(jsonContent, new TypeReference<List<Map<String, Object>>>() {});
    }

    private void insertDataInBatches(MongoCollection<Document> collection, List<Map<String, Object>> hotelData) {
        var batchSizeStr = System.getenv("LOAD_SIZE_BATCH");
        var batchSize = batchSizeStr != null ? Integer.parseInt(batchSizeStr) : 100;
        var batches = partitionList(hotelData, batchSize);

        System.out.println("Processing in batches of " + batchSize + "...");

        for (int i = 0; i < batches.size(); i++) {
            var batch = batches.get(i);
            var documents = batch.stream()
                .map(Document::new)
                .toList();

            collection.insertMany(documents);
            System.out.println("Batch " + (i + 1) + " complete: " + documents.size() + " inserted");
        }
    }

    private void createStandardIndexes(MongoCollection<Document> collection) {
        collection.createIndex(Indexes.ascending("HotelId"));
        collection.createIndex(Indexes.ascending("Category"));
        collection.createIndex(Indexes.ascending("Description"));
        collection.createIndex(Indexes.ascending("Description_fr"));
    }

    private void createVectorIndex(MongoDatabase database) {
        var embeddedField = System.getenv("EMBEDDED_FIELD");
        var dimensionsStr = System.getenv("EMBEDDING_DIMENSIONS");
        var dimensions = dimensionsStr != null ? Integer.parseInt(dimensionsStr) : 1536;

        var indexDefinition = new Document()
            .append("createIndexes", COLLECTION_NAME)
            .append("indexes", List.of(
                new Document()
                    .append("name", VECTOR_INDEX_NAME)
                    .append("key", new Document(embeddedField, "cosmosSearch"))
                    .append("cosmosSearchOptions", new Document()
                        .append("kind", "vector-diskann")
                        .append("dimensions", dimensions)
                        .append("similarity", "COS")
                        .append("maxDegree", 20)
                        .append("lBuild", 10)
                    )
            ));

        database.runCommand(indexDefinition);
        System.out.println("Created vector index: " + VECTOR_INDEX_NAME);
    }

    private List<Double> createEmbedding(OpenAIClient openAIClient, String text) {
        var model = System.getenv("AZURE_OPENAI_EMBEDDING_MODEL");
        var options = new EmbeddingsOptions(List.of(text));

        var response = openAIClient.getEmbeddings(model, options);
        return response.getData().get(0).getEmbedding().stream()
                .map(Float::doubleValue)
                .toList();
    }

    private void performVectorSearch(MongoCollection<Document> collection, List<Double> queryEmbedding) {
        var embeddedField = System.getenv("EMBEDDED_FIELD");

        var searchStage = new Document("$search", new Document()
            .append("cosmosSearch", new Document()
                .append("vector", queryEmbedding)
                .append("path", embeddedField)
                .append("k", 5)
            )
        );

        var projectStage = new Document("$project", new Document()
            .append("score", new Document("$meta", "searchScore"))
            .append("document", "$$ROOT")
        );

        var pipeline = List.of(searchStage, projectStage);

        System.out.println("\nVector search results for: \"" + SAMPLE_QUERY + "\"");

        AggregateIterable<Document> results = collection.aggregate(pipeline);
        var rank = 1;

        for (var result : results) {
            var document = result.get("document", Document.class);
            var hotelName = document.getString("HotelName");
            var score = result.getDouble("score");
            System.out.printf("%d. HotelName: %s, Score: %.4f%n", rank++, hotelName, score);
        }
    }

    private static <T> List<List<T>> partitionList(List<T> list, int batchSize) {
        var partitions = new ArrayList<List<T>>();
        for (int i = 0; i < list.size(); i += batchSize) {
            partitions.add(list.subList(i, Math.min(i + batchSize, list.size())));
        }
        return partitions;
    }
}

This code performs the following tasks:

  • Creates a passwordless connection to Azure DocumentDB using DefaultAzureCredential and the MongoDB OIDC mechanism
  • Creates an Azure OpenAI client for generating embeddings
  • Drops and recreates the collection, then loads hotel data from the JSON file in batches
  • Creates standard indexes and a vector index with algorithm-specific options
  • Generates an embedding for a sample query and runs an aggregation search pipeline
  • Prints the top five matching hotels with similarity scores

Authenticate to Azure

Sign in to Azure before you run the application so it can access Azure resources securely.

Note

Ensure you're signed-in identity has the required data plane roles on both the Azure DocumentDB account and the Azure OpenAI resource.

az login

Build the application

Compile the application:

mvn clean compile

Run DiskANN (Disk-based Approximate Nearest Neighbor) search:

mvn exec:java -Dexec.mainClass="com.azure.documentdb.samples.DiskAnn"

DiskANN is optimized for large datasets that don't fit in memory, efficient disk-based storage, and a good balance of speed and accuracy.

Example output:

Created collection: hotels_diskann
Reading JSON file from /workspaces/documentdb-samples/ai/vector-search-java/../data/Hotels_Vector.json
Processing in batches of 50...
Batch 1 complete: 50 inserted
Created vector index: vectorIndex_diskann

Vector search results for: "quintessential lodging near running trails, eateries, retail"
1. HotelName: Royal Cottage Resort, Score: 0.4991
2. HotelName: Country Comfort Inn, Score: 0.4786
3. HotelName: Nordick's Valley Motel, Score: 0.4635
4. HotelName: Economy Universe Motel, Score: 0.4462
5. HotelName: Roach Motel, Score: 0.4389

View and manage data in Visual Studio Code

  1. Install the DocumentDB extension and Extension Pack for Java in Visual Studio Code.

  2. Connect to your Azure DocumentDB account using the DocumentDB extension.

  3. View the data and indexes in the Hotels database.

    Screenshot of DocumentDB extension showing the DocumentDB collection.

Clean up resources

Delete the resource group, Azure DocumentDB cluster, and Azure OpenAI resource when you no longer need them to avoid unnecessary costs.