Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Learn to use vector search in Azure DocumentDB with the Java MongoDB driver to store and query vector data efficiently.
This quickstart provides a guided tour of key vector search techniques using a Java sample app on GitHub.
The app uses a sample hotel dataset in a JSON file with pre-calculated vectors from the text-embedding-3-small model, though you can also generate the vectors yourself. The hotel data includes hotel names, locations, descriptions, and vector embeddings.
Prerequisites
An Azure subscription
- If you don't have an Azure subscription, create a free account
An existing Azure DocumentDB cluster
If you don't have a cluster, create a new cluster
Firewall configured to allow access to your client IP address
-
Custom domain configured
text-embedding-3-smallmodel deployed
Use the Bash environment in Azure Cloud Shell. For more information, see Get started with Azure Cloud Shell.
If you prefer to run CLI reference commands locally, install the Azure CLI. If you're running on Windows or macOS, consider running Azure CLI in a Docker container. For more information, see How to run the Azure CLI in a Docker container.
If you're using a local installation, sign in to the Azure CLI by using the az login command. To finish the authentication process, follow the steps displayed in your terminal. For other sign-in options, see Authenticate to Azure using Azure CLI.
When you're prompted, install the Azure CLI extension on first use. For more information about extensions, see Use and manage extensions with the Azure CLI.
Run az version to find the version and dependent libraries that are installed. To upgrade to the latest version, run az upgrade.
Create data file with vectors
Create a new data directory for the hotels data file:
mkdir dataCopy the
Hotels_Vector.jsonraw data file with vectors to yourdatadirectory.
Create a Java project
Create a new sibling directory for your project, at the same level as the data directory, and open it in Visual Studio Code:
mkdir vector-search-quickstart mkdir vector-search-quickstart/src code vector-search-quickstartCreate a
pom.xmlfile in the project root with the following content:<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.azure.documentdb.samples</groupId> <artifactId>vector-search-quickstart</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.release>21</maven.compiler.release> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>5.6.2</version> </dependency> <dependency> <groupId>com.azure</groupId> <artifactId>azure-identity</artifactId> <version>1.18.1</version> </dependency> <dependency> <groupId>com.azure</groupId> <artifactId>azure-ai-openai</artifactId> <version>1.0.0-beta.16</version> </dependency> <dependency> <groupId>tools.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>3.0.3</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-nop</artifactId> <version>2.0.17</version> <scope>runtime</scope> </dependency> </dependencies> </project>The app uses the following Maven dependencies specified in the
pom.xml:mongodb-driver-sync: Official MongoDB Java driver for database connectivity and operationsazure-identity: Azure Identity library for passwordless authentication with Microsoft Entra IDazure-ai-openai: Azure OpenAI client library to communicate with AI models and create vector embeddingsjackson-databind: JSON serialization and deserialization libraryslf4j-nop: No-operation SLF4J binding to suppress logging output from the MongoDB driver
Create a
.envfile in your project root for environment variables:# Azure OpenAI Embedding Settings AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-3-small AZURE_OPENAI_EMBEDDING_API_VERSION=2023-05-15 AZURE_OPENAI_EMBEDDING_ENDPOINT= EMBEDDING_SIZE_BATCH=16 # Azure DocumentDB configuration MONGO_CLUSTER_NAME= # Data file DATA_FILE_WITH_VECTORS=../data/Hotels_Vector.json EMBEDDED_FIELD=DescriptionVector EMBEDDING_DIMENSIONS=1536 LOAD_SIZE_BATCH=50Replace the placeholder values in the
.envfile with your own information:AZURE_OPENAI_EMBEDDING_ENDPOINT: Your Azure OpenAI resource endpoint URL.MONGO_CLUSTER_NAME: Your Azure DocumentDB resource name.
Load the environment variables:
set -a && source .env && set +aThe project structure should look like this:
data └── Hotels_Vector.json vector-search-quickstart ├── .env ├── pom.xml └── src
Add code for vector search
Create a DiskAnn.java file in the src directory and paste in the following code:
package com.azure.documentdb.samples;
import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.ai.openai.models.EmbeddingsOptions;
import com.azure.identity.DefaultAzureCredentialBuilder;
import com.mongodb.ConnectionString;
import com.mongodb.MongoClientSettings;
import com.mongodb.MongoCredential;
import com.mongodb.client.AggregateIterable;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.Indexes;
import org.bson.Document;
import tools.jackson.core.type.TypeReference;
import tools.jackson.databind.json.JsonMapper;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
/**
* Vector search sample using DiskANN index.
*/
public class DiskAnn {
private static final String SAMPLE_QUERY = "quintessential lodging near running trails, eateries, retail";
private static final String DATABASE_NAME = "Hotels";
private static final String COLLECTION_NAME = "hotels_diskann";
private static final String VECTOR_INDEX_NAME = "vectorIndex_diskann";
private final JsonMapper jsonMapper = JsonMapper.builder().build();
public static void main(String[] args) {
new DiskAnn().run();
System.exit(0);
}
public void run() {
try (var mongoClient = createMongoClient()) {
var openAIClient = createOpenAIClient();
var database = mongoClient.getDatabase(DATABASE_NAME);
var collection = database.getCollection(COLLECTION_NAME, Document.class);
// Drop and recreate collection
collection.drop();
database.createCollection(COLLECTION_NAME);
System.out.println("Created collection: " + COLLECTION_NAME);
// Load and insert data
var hotelData = loadHotelData();
insertDataInBatches(collection, hotelData);
// Create standard indexes
createStandardIndexes(collection);
// Create vector index
createVectorIndex(database);
// Perform vector search
var queryEmbedding = createEmbedding(openAIClient, SAMPLE_QUERY);
performVectorSearch(collection, queryEmbedding);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
private MongoClient createMongoClient() {
var clusterName = System.getenv("MONGO_CLUSTER_NAME");
var managedIdentityPrincipalId = System.getenv("AZURE_MANAGED_IDENTITY_PRINCIPAL_ID");
var azureCredential = new DefaultAzureCredentialBuilder().build();
MongoCredential.OidcCallback callback = (MongoCredential.OidcCallbackContext context) -> {
var token = azureCredential.getToken(
new com.azure.core.credential.TokenRequestContext()
.addScopes("https://ossrdbms-aad.database.windows.net/.default")
).block();
if (token == null) {
throw new RuntimeException("Failed to obtain Azure AD token");
}
return new MongoCredential.OidcCallbackResult(token.getToken());
};
var credential = MongoCredential.createOidcCredential(null)
.withMechanismProperty("OIDC_CALLBACK", callback);
var connectionString = new ConnectionString(
String.format("mongodb+srv://%s@%s.mongocluster.cosmos.azure.com/?authMechanism=MONGODB-OIDC&tls=true&retrywrites=false&maxIdleTimeMS=120000",
managedIdentityPrincipalId, clusterName)
);
var settings = MongoClientSettings.builder()
.applyConnectionString(connectionString)
.credential(credential)
.build();
return MongoClients.create(settings);
}
private OpenAIClient createOpenAIClient() {
var endpoint = System.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT");
var credential = new DefaultAzureCredentialBuilder().build();
return new OpenAIClientBuilder()
.endpoint(endpoint)
.credential(credential)
.buildClient();
}
private List<Map<String, Object>> loadHotelData() throws IOException {
var dataFile = System.getenv("DATA_FILE_WITH_VECTORS");
var filePath = Path.of(dataFile);
System.out.println("Reading JSON file from " + filePath.toAbsolutePath());
var jsonContent = Files.readString(filePath);
return jsonMapper.readValue(jsonContent, new TypeReference<List<Map<String, Object>>>() {});
}
private void insertDataInBatches(MongoCollection<Document> collection, List<Map<String, Object>> hotelData) {
var batchSizeStr = System.getenv("LOAD_SIZE_BATCH");
var batchSize = batchSizeStr != null ? Integer.parseInt(batchSizeStr) : 100;
var batches = partitionList(hotelData, batchSize);
System.out.println("Processing in batches of " + batchSize + "...");
for (int i = 0; i < batches.size(); i++) {
var batch = batches.get(i);
var documents = batch.stream()
.map(Document::new)
.toList();
collection.insertMany(documents);
System.out.println("Batch " + (i + 1) + " complete: " + documents.size() + " inserted");
}
}
private void createStandardIndexes(MongoCollection<Document> collection) {
collection.createIndex(Indexes.ascending("HotelId"));
collection.createIndex(Indexes.ascending("Category"));
collection.createIndex(Indexes.ascending("Description"));
collection.createIndex(Indexes.ascending("Description_fr"));
}
private void createVectorIndex(MongoDatabase database) {
var embeddedField = System.getenv("EMBEDDED_FIELD");
var dimensionsStr = System.getenv("EMBEDDING_DIMENSIONS");
var dimensions = dimensionsStr != null ? Integer.parseInt(dimensionsStr) : 1536;
var indexDefinition = new Document()
.append("createIndexes", COLLECTION_NAME)
.append("indexes", List.of(
new Document()
.append("name", VECTOR_INDEX_NAME)
.append("key", new Document(embeddedField, "cosmosSearch"))
.append("cosmosSearchOptions", new Document()
.append("kind", "vector-diskann")
.append("dimensions", dimensions)
.append("similarity", "COS")
.append("maxDegree", 20)
.append("lBuild", 10)
)
));
database.runCommand(indexDefinition);
System.out.println("Created vector index: " + VECTOR_INDEX_NAME);
}
private List<Double> createEmbedding(OpenAIClient openAIClient, String text) {
var model = System.getenv("AZURE_OPENAI_EMBEDDING_MODEL");
var options = new EmbeddingsOptions(List.of(text));
var response = openAIClient.getEmbeddings(model, options);
return response.getData().get(0).getEmbedding().stream()
.map(Float::doubleValue)
.toList();
}
private void performVectorSearch(MongoCollection<Document> collection, List<Double> queryEmbedding) {
var embeddedField = System.getenv("EMBEDDED_FIELD");
var searchStage = new Document("$search", new Document()
.append("cosmosSearch", new Document()
.append("vector", queryEmbedding)
.append("path", embeddedField)
.append("k", 5)
)
);
var projectStage = new Document("$project", new Document()
.append("score", new Document("$meta", "searchScore"))
.append("document", "$$ROOT")
);
var pipeline = List.of(searchStage, projectStage);
System.out.println("\nVector search results for: \"" + SAMPLE_QUERY + "\"");
AggregateIterable<Document> results = collection.aggregate(pipeline);
var rank = 1;
for (var result : results) {
var document = result.get("document", Document.class);
var hotelName = document.getString("HotelName");
var score = result.getDouble("score");
System.out.printf("%d. HotelName: %s, Score: %.4f%n", rank++, hotelName, score);
}
}
private static <T> List<List<T>> partitionList(List<T> list, int batchSize) {
var partitions = new ArrayList<List<T>>();
for (int i = 0; i < list.size(); i += batchSize) {
partitions.add(list.subList(i, Math.min(i + batchSize, list.size())));
}
return partitions;
}
}
This code performs the following tasks:
- Creates a passwordless connection to Azure DocumentDB using
DefaultAzureCredentialand the MongoDB OIDC mechanism - Creates an Azure OpenAI client for generating embeddings
- Drops and recreates the collection, then loads hotel data from the JSON file in batches
- Creates standard indexes and a vector index with algorithm-specific options
- Generates an embedding for a sample query and runs an aggregation search pipeline
- Prints the top five matching hotels with similarity scores
Authenticate to Azure
Sign in to Azure before you run the application so it can access Azure resources securely.
Note
Ensure you're signed-in identity has the required data plane roles on both the Azure DocumentDB account and the Azure OpenAI resource.
az login
Build the application
Compile the application:
mvn clean compile
Run DiskANN (Disk-based Approximate Nearest Neighbor) search:
mvn exec:java -Dexec.mainClass="com.azure.documentdb.samples.DiskAnn"
DiskANN is optimized for large datasets that don't fit in memory, efficient disk-based storage, and a good balance of speed and accuracy.
Example output:
Created collection: hotels_diskann
Reading JSON file from /workspaces/documentdb-samples/ai/vector-search-java/../data/Hotels_Vector.json
Processing in batches of 50...
Batch 1 complete: 50 inserted
Created vector index: vectorIndex_diskann
Vector search results for: "quintessential lodging near running trails, eateries, retail"
1. HotelName: Royal Cottage Resort, Score: 0.4991
2. HotelName: Country Comfort Inn, Score: 0.4786
3. HotelName: Nordick's Valley Motel, Score: 0.4635
4. HotelName: Economy Universe Motel, Score: 0.4462
5. HotelName: Roach Motel, Score: 0.4389
View and manage data in Visual Studio Code
Install the DocumentDB extension and Extension Pack for Java in Visual Studio Code.
Connect to your Azure DocumentDB account using the DocumentDB extension.
View the data and indexes in the Hotels database.
Clean up resources
Delete the resource group, Azure DocumentDB cluster, and Azure OpenAI resource when you no longer need them to avoid unnecessary costs.