你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

快速入门：在 Azure DocumentDB 中使用 Python 进行矢量搜索

在 Azure DocumentDB 中将矢量搜索与 Python 客户端库配合使用。高效存储和查询矢量数据。

本快速入门使用 JSON 文件中的示例酒店数据集，其中包含来自 text-embedding-3-small 模型的预计算矢量。数据集包括酒店名称、位置、说明和矢量嵌入。

在 GitHub 上查找示例代码。

先决条件

Azure 订阅服务
- 如果没有 Azure 订阅，请创建一个免费帐户

现有的 Azure DocumentDB 群集
- 如果没有群集，请创建新群集
- 已启用基于角色的访问控制（RBAC）
- 防火墙已配置为允许访问您的客户端 IP 地址
Azure OpenAI 资源
- 配置的自定义域
- 已启用基于角色的访问控制（RBAC）
- text-embedding-3-small 已部署的模型
Visual Studio Code
- DocumentDB 扩展

在 Azure Cloud Shell 中使用 Bash 环境。有关详细信息，请参阅 Azure Cloud Shell 入门。
如需在本地运行 CLI 参考命令，请安装 Azure CLI。如果在 Windows 或 macOS 上运行，请考虑在 Docker 容器中运行 Azure CLI。有关详细信息，请参阅如何在 Docker 容器中运行 Azure CLI。
- 如果使用的是本地安装，请使用 az login 命令登录到 Azure CLI。若要完成身份验证过程，请遵循终端中显示的步骤。有关其他登录选项，请参阅使用 Azure CLI 向 Azure 进行身份验证。
- 出现提示时，请在首次使用时安装 Azure CLI 扩展。有关扩展的详细信息，请参阅使用和管理 Azure CLI 中的扩展。
- 运行az version命令，以查看已安装的版本和依赖库。若要升级到最新版本，请运行az upgrade。

Python 3.9 或更高版本

使用矢量创建数据文件

为酒店数据文件创建新的数据目录：
```
mkdir data
```
将 Hotels_Vector.json包含矢量的原始数据文件复制到 data 目录。

创建 Python 项目

为项目创建新目录，并在 Visual Studio Code 中打开它：
```
mkdir vector-search-quickstart
code vector-search-quickstart
```

在终端中，创建并激活虚拟环境：

对于 Windows：

python -m venv venv
venv\\Scripts\\activate

对于 macOS/Linux：

python -m venv venv
source venv/bin/activate

安装所需的包：
```
pip install pymongo azure-identity openai python-dotenv
```
- pymongo：用于 Python 的 MongoDB 驱动程序
- azure-identity：用于无密码身份验证的 Azure 标识库
- openai：用于创建向量的 OpenAI 客户端库
- python-dotenv：.env 文件中的环境变量管理
创建.env文件用于vector-search-quickstart中的环境变量。
```
# Identity for local developer authentication with Azure CLI
AZURE_TOKEN_CREDENTIALS=AzureCliCredential

# Azure OpenAI configuration
AZURE_OPENAI_EMBEDDING_ENDPOINT= 
AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-3-small
AZURE_OPENAI_EMBEDDING_API_VERSION=2023-05-15

# Azure DocumentDB configuration
MONGO_CLUSTER_NAME=

# Data Configuration (defaults should work)
DATA_FILE_WITH_VECTORS=../data/Hotels_Vector.json
EMBEDDED_FIELD=DescriptionVector
EMBEDDING_DIMENSIONS=1536
EMBEDDING_SIZE_BATCH=16
LOAD_SIZE_BATCH=50
```
对于本文中使用的无密码身份验证，请将文件中的 .env 占位符值替换为你自己的信息：
- AZURE_OPENAI_EMBEDDING_ENDPOINT：Azure OpenAI 资源终结点的 URL
- MONGO_CLUSTER_NAME：Azure DocumentDB 资源名称
应始终首选无密码身份验证，但需要进行其他设置。有关设置托管标识和各种身份验证选项的详细信息，请参阅使用用于 Python 的 Azure SDK 向 Azure 服务验证 Python 应用。

为矢量搜索创建代码文件

通过创建矢量搜索的代码文件继续项目。完成后，项目结构应如下所示：

├── data/
│   ├── Hotels.json              # Source hotel data (without vectors)
│   └── Hotels_Vector.json       # Hotel data with vector embeddings
└── vector-search-quickstart/
    ├── src/
    │   ├── diskann.py           # DiskANN vector search implementation
    │   ├── hnsw.py              # HNSW vector search implementation
    │   ├── ivf.py               # IVF vector search implementation
    │   └── utils.py              # Shared utility functions
    ├── requirements.txt         # Python dependencies
    ├── .env                     # Environment variables template

为 Python 文件创建一个src目录。添加两个文件：diskann.py 和 utils.py 用于 DiskANN 索引的实现。

mkdir src    
touch src/diskann.py
touch src/utils.py

为 Python 文件创建一个src目录。添加两个文件：ivf.py 和 utils.py，用于 IVF 索引实现。

mkdir src
touch src/ivf.py
touch src/utils.py

为 Python 文件创建一个src目录。添加两个文件： hnsw.py 以及 utils.py 用于 HNSW 索引实现：

mkdir src
touch src/hnsw.py
touch src/utils.py

为矢量搜索创建代码

将以下代码粘贴到 diskann.py 文件中。

import os
from typing import List, Dict, Any
from utils import get_clients, get_clients_passwordless, read_file_return_json, insert_data, print_search_results, drop_vector_indexes
from dotenv import load_dotenv

# Load environment variables
load_dotenv()


def create_diskann_vector_index(collection, vector_field: str, dimensions: int) -> None:

    print(f"Creating DiskANN vector index on field '{vector_field}'...")

    # Drop any existing vector indexes on this field first
    drop_vector_indexes(collection, vector_field)

    # Use the native MongoDB command for DocumentDB vector indexes
    index_command = {
        "createIndexes": collection.name,
        "indexes": [
            {
                "name": f"diskann_index_{vector_field}",
                "key": {
                    vector_field: "cosmosSearch"  # DocumentDB vector search index type
                },
                "cosmosSearchOptions": {
                    # DiskANN algorithm configuration
                    "kind": "vector-diskann",

                    # Vector dimensions must match the embedding model
                    "dimensions": dimensions,

                    # Vector similarity metric - cosine is good for text embeddings
                    "similarity": "COS",

                    # Maximum degree: number of edges per node in the graph
                    # Higher values improve accuracy but increase memory usage
                    "maxDegree": 20,

                    # Build parameter: candidates evaluated during index construction
                    # Higher values improve index quality but increase build time
                    "lBuild": 10
                }
            }
        ]
    }

    try:
        # Execute the createIndexes command directly
        result = collection.database.command(index_command)
        print("DiskANN vector index created successfully")

    except Exception as e:
        print(f"Error creating DiskANN vector index: {e}")

        # Check if it's a tier limitation and suggest alternatives
        if "not enabled for this cluster tier" in str(e):
            print("\nDiskANN indexes require a higher cluster tier.")
            print("Try one of these alternatives:")
            print("  • Upgrade your DocumentDB cluster to a higher tier")
            print("  • Use HNSW instead: python src/hnsw.py")
            print("  • Use IVF instead: python src/ivf.py")
        raise


def perform_diskann_vector_search(collection,
                                 azure_openai_client,
                                 query_text: str,
                                 vector_field: str,
                                 model_name: str,
                                 top_k: int = 5) -> List[Dict[str, Any]]:

    print(f"Performing DiskANN vector search for: '{query_text}'")

    try:
        # Generate embedding for the query text
        embedding_response = azure_openai_client.embeddings.create(
            input=[query_text],
            model=model_name
        )

        query_embedding = embedding_response.data[0].embedding

        # Construct the aggregation pipeline for vector search
        # DocumentDB uses $search with cosmosSearch
        pipeline = [
            {
                "$search": {
                    # Use cosmosSearch for vector operations in DocumentDB
                    "cosmosSearch": {
                        # The query vector to search for
                        "vector": query_embedding,

                        # Field containing the document vectors to compare against
                        "path": vector_field,

                        # Number of final results to return
                        "k": top_k
                    }
                }
            },
            {
                # Add similarity score to the results
                "$project": {
                    "document": "$$ROOT",
                    # Add search score from metadata
                    "score": {"$meta": "searchScore"}
                }
            }
        ]

        # Execute the aggregation pipeline
        results = list(collection.aggregate(pipeline))

        return results

    except Exception as e:
        print(f"Error performing DiskANN vector search: {e}")
        raise


def main():

    # Load configuration from environment variables
    config = {
        'cluster_name': os.getenv('MONGO_CLUSTER_NAME'),
        'database_name': 'Hotels',
        'collection_name': 'hotels_diskann',
        'data_file': os.getenv('DATA_FILE_WITH_VECTORS', '../data/Hotels_Vector.json'),
        'vector_field': os.getenv('EMBEDDED_FIELD', 'DescriptionVector'),
        'model_name': os.getenv('AZURE_OPENAI_EMBEDDING_MODEL', 'text-embedding-3-small'),
        'dimensions': int(os.getenv('EMBEDDING_DIMENSIONS', '1536')),
        'batch_size': int(os.getenv('LOAD_SIZE_BATCH', '100'))
    }

    try:
        # Initialize clients
        print("\nInitializing MongoDB and Azure OpenAI clients...")
        mongo_client, azure_openai_client = get_clients_passwordless()

        # Get database and collection
        database = mongo_client[config['database_name']]
        collection = database[config['collection_name']]

        # Load data with embeddings
        print(f"\nLoading data from {config['data_file']}...")
        data = read_file_return_json(config['data_file'])
        print(f"Loaded {len(data)} documents")

        # Verify embeddings are present
        documents_with_embeddings = [doc for doc in data if config['vector_field'] in doc]
        if not documents_with_embeddings:
            raise ValueError(f"No documents found with embeddings in field '{config['vector_field']}'. "
                           "Please run create_embeddings.py first.")

        # Insert data into collection
        print(f"\nInserting data into collection '{config['collection_name']}'...")

        # Insert the hotel data
        stats = insert_data(
            collection,
            documents_with_embeddings,
            batch_size=config['batch_size']
        )

        if stats['inserted'] == 0 and not stats.get('skipped'):
            raise ValueError("No documents were inserted successfully")

        # Create DiskANN vector index (skip if data was already present)
        if not stats.get('skipped'):
            create_diskann_vector_index(
                collection,
                config['vector_field'],
                config['dimensions']
            )

            # Wait briefly for index to be ready
            import time
            print("Waiting for index to be ready...")
            time.sleep(2)

        # Perform sample vector search
        query = "quintessential lodging near running trails, eateries, retail"

        results = perform_diskann_vector_search(
            collection,
            azure_openai_client,
            query,
            config['vector_field'],
            config['model_name'],
            top_k=5
        )

        # Display results
        print_search_results(results, max_results=5, show_score=True)


    except Exception as e:
        print(f"\nError during DiskANN demonstration: {e}")
        raise

    finally:
        # Close the MongoDB client
        if 'mongo_client' in locals():
            mongo_client.close()


if __name__ == "__main__":
    main()

将以下代码粘贴到 ivf.py 文件中。

import os
from typing import List, Dict, Any
from utils import get_clients, get_clients_passwordless,read_file_return_json, insert_data, print_search_results, drop_vector_indexes
from dotenv import load_dotenv

# Load environment variables
load_dotenv()


def create_ivf_vector_index(collection, vector_field: str, dimensions: int) -> None:

    print(f"Creating IVF vector index on field '{vector_field}'...")

    # Drop any existing vector indexes on this field first
    drop_vector_indexes(collection, vector_field)

    # Use the native MongoDB command for DocumentDB vector indexes
    index_command = {
        "createIndexes": collection.name,
        "indexes": [
            {
                "name": f"ivf_index_{vector_field}",
                "key": {
                    vector_field: "cosmosSearch"  # DocumentDB vector search index type
                },
                "cosmosSearchOptions": {
                    # IVF algorithm configuration
                    "kind": "vector-ivf",

                    # Vector dimensions must match the embedding model
                    "dimensions": dimensions,

                    # Cosine similarity is effective for text embeddings
                    "similarity": "COS",

                    # Number of clusters (centroids) to partition vectors into
                    # More clusters = faster search but potentially lower recall
                    # For small datasets like this, use fewer clusters
                    "numLists": 10
                }
            }
        ]
    }

    try:
        # Execute the createIndexes command directly
        result = collection.database.command(index_command)
        print("IVF vector index created successfully")

    except Exception as e:
        print(f"Error creating IVF vector index: {e}")
        raise


def perform_ivf_vector_search(collection,
                             azure_openai_client,
                             query_text: str,
                             vector_field: str,
                             model_name: str,
                             top_k: int = 5,
                             num_probes: int = 1) -> List[Dict[str, Any]]:

    print(f"Performing IVF vector search for: '{query_text}'")

    try:
        # Generate embedding vector for the search query
        embedding_response = azure_openai_client.embeddings.create(
            input=[query_text],
            model=model_name
        )

        query_embedding = embedding_response.data[0].embedding

        # Construct aggregation pipeline for IVF vector search
        pipeline = [
            {
                "$search": {
                    # Use cosmosSearch for vector operations in DocumentDB
                    "cosmosSearch": {
                        # Query vector to find similar documents
                        "vector": query_embedding,

                        # Document field containing vectors to search against
                        "path": vector_field,

                        # Final number of results to return
                        "k": top_k
                    }
                }
            },
            {
                # Project only the fields we want in the output and add similarity score
                "$project": {
                    "document": "$$ROOT",
                    # Add search score from metadata
                    "score": {"$meta": "searchScore"}
                }
            }
        ]

        # Run the search aggregation pipeline
        results = list(collection.aggregate(pipeline))

        return results

    except Exception as e:
        print(f"Error performing IVF vector search: {e}")
        raise


def main():

    print("Starting IVF vector search demonstration...")

    # Load configuration from environment variables
    config = {
        'cluster_name': os.getenv('MONGO_CLUSTER_NAME'),
        'database_name': 'Hotels',
        'collection_name': 'hotels_ivf',
        'data_file': os.getenv('DATA_FILE_WITH_VECTORS', '../data/Hotels_Vector.json'),
        'vector_field': os.getenv('EMBEDDED_FIELD', 'DescriptionVector'),
        'model_name': os.getenv('AZURE_OPENAI_EMBEDDING_MODEL', 'text-embedding-3-small'),
        'dimensions': int(os.getenv('EMBEDDING_DIMENSIONS', '1536')),
        'batch_size': int(os.getenv('LOAD_SIZE_BATCH', '100'))
    }

    try:
        # Initialize database and AI service clients
        print("\nInitializing clients...")
        mongo_client, azure_openai_client = get_clients_passwordless()

        # Connect to database and collection
        database = mongo_client[config['database_name']]
        collection = database[config['collection_name']]

        # Load hotel data with embeddings
        print(f"\nLoading data from {config['data_file']}...")
        data = read_file_return_json(config['data_file'])
        print(f"Loaded {len(data)} documents")

        # Verify embeddings exist in the data
        documents_with_embeddings = [doc for doc in data if config['vector_field'] in doc]
        if not documents_with_embeddings:
            raise ValueError(f"No documents found with embeddings in field '{config['vector_field']}'. "
                           "Please run create_embeddings.py first.")

        # Prepare collection with fresh data
        print(f"\nPreparing collection '{config['collection_name']}'...")

        # Insert hotel data with embeddings
        stats = insert_data(
            collection,
            documents_with_embeddings,
            batch_size=config['batch_size']
        )

        if stats['inserted'] == 0 and not stats.get('skipped'):
            raise ValueError("No documents were inserted successfully")

        # Create IVF vector index (skip if data was already present)
        if not stats.get('skipped'):
            print("\nCreating IVF vector index...")
            create_ivf_vector_index(
                collection,
                config['vector_field'],
                config['dimensions']
            )

            # Wait for index to be built and ready
            import time
            print("Waiting for index clustering to complete...")
            time.sleep(3)  # IVF may need more time for clustering

        # Demonstrate IVF search 
        query = "quintessential lodging near running trails, eateries, retail"

        results = perform_ivf_vector_search(
            collection,
            azure_openai_client,
            query,
            config['vector_field'],
            config['model_name'],
            top_k=5
        )

        # Display the search results
        print_search_results(results)

    except Exception as e:
        print(f"\nError during IVF demonstration: {e}")
        raise

    finally:
        # Ensure MongoDB connection is properly closed
        if 'mongo_client' in locals():
            mongo_client.close()


if __name__ == "__main__":
    main()

将以下代码粘贴到 hnsw.py 文件中。

import os
from typing import List, Dict, Any
from utils import get_clients, get_clients_passwordless, read_file_return_json, insert_data, print_search_results, drop_vector_indexes
from dotenv import load_dotenv

# Load environment variables
load_dotenv()


def create_hnsw_vector_index(collection, vector_field: str, dimensions: int) -> None:

    print(f"Creating HNSW vector index on field '{vector_field}'...")

    # Drop any existing vector indexes on this field first
    drop_vector_indexes(collection, vector_field)

    # Use the native MongoDB command for DocumentDB vector indexes
    index_command = {
        "createIndexes": collection.name,
        "indexes": [
            {
                "name": f"hnsw_index_{vector_field}",
                "key": {
                    vector_field: "cosmosSearch"  # DocumentDB vector search index type
                },
                "cosmosSearchOptions": {
                    # HNSW algorithm configuration
                    "kind": "vector-hnsw",

                    # Vector dimensions must match the embedding model
                    "dimensions": dimensions,

                    # Cosine similarity works well with text embeddings
                    "similarity": "COS",

                    # Maximum connections per node in the graph (parameter 'm')
                    # Higher values improve recall but increase memory usage and build time
                    "m": 16,

                    # Size of the candidate list during construction
                    # Higher values improve index quality but slow down building
                    "efConstruction": 64
                }
            }
        ]
    }

    try:
        # Execute the createIndexes command directly
        result = collection.database.command(index_command)
        print("HNSW vector index created successfully")

    except Exception as e:
        print(f"Error creating HNSW vector index: {e}")
        raise


def perform_hnsw_vector_search(collection,
                              azure_openai_client,
                              query_text: str,
                              vector_field: str,
                              model_name: str,
                              top_k: int = 5,
                              ef_search: int = 16) -> List[Dict[str, Any]]:

    print(f"Performing HNSW vector search for: '{query_text}'")

    try:
        # Convert query text to embedding vector
        embedding_response = azure_openai_client.embeddings.create(
            input=[query_text],
            model=model_name
        )

        query_embedding = embedding_response.data[0].embedding

        # Build aggregation pipeline for HNSW vector search
        pipeline = [
            {
                "$search": {
                    # Use cosmosSearch for vector operations in DocumentDB
                    "cosmosSearch": {
                        # Query vector to find similar documents for
                        "vector": query_embedding,

                        # Field in documents containing vectors to compare against
                        "path": vector_field,

                        # Maximum number of results to return
                        "k": top_k
                    }
                }
            },
            {
                # Select only the fields needed for display and add similarity score
                "$project": {
                    "document": "$$ROOT",
                    # Add search score from metadata
                    "score": {"$meta": "searchScore"}
                }
            }
        ]

        # Execute the search pipeline
        results = list(collection.aggregate(pipeline))

        return results

    except Exception as e:
        print(f"Error performing HNSW vector search: {e}")
        raise


def main():

    print("Starting HNSW vector search demonstration...")

    # Load configuration from environment variables
    config = {
        'cluster_name': os.getenv('MONGO_CLUSTER_NAME'),
        'database_name': 'Hotels',
        'collection_name': 'hotels_hnsw',
        'data_file': os.getenv('DATA_FILE_WITH_VECTORS', '../data/Hotels_Vector.json'),
        'vector_field': os.getenv('EMBEDDED_FIELD', 'DescriptionVector'),
        'model_name': os.getenv('AZURE_OPENAI_EMBEDDING_MODEL', 'text-embedding-3-small'),
        'dimensions': int(os.getenv('EMBEDDING_DIMENSIONS', '1536')),
        'batch_size': int(os.getenv('LOAD_SIZE_BATCH', '100'))
    }

    try:
        # Initialize MongoDB and Azure OpenAI clients
        print("\nInitializing clients...")
        mongo_client, azure_openai_client = get_clients_passwordless()

        # Access database and collection
        database = mongo_client[config['database_name']]
        collection = database[config['collection_name']]

        # Load hotel data with embeddings
        print(f"\nLoading data from {config['data_file']}...")
        data = read_file_return_json(config['data_file'])
        print(f"Loaded {len(data)} documents")

        # Verify that embeddings are present in the data
        documents_with_embeddings = [doc for doc in data if config['vector_field'] in doc]
        if not documents_with_embeddings:
            raise ValueError(f"No documents found with embeddings in field '{config['vector_field']}'. "
                           "Please run create_embeddings.py first.")

        # Insert data into MongoDB collection
        print(f"\nPreparing collection '{config['collection_name']}'...")

        # Insert hotel data with embeddings
        stats = insert_data(
            collection,
            documents_with_embeddings,
            batch_size=config['batch_size']
        )

        if stats['inserted'] == 0 and not stats.get('skipped'):
            raise ValueError("No documents were inserted successfully")

        # Create HNSW vector index (skip if data was already present)
        if not stats.get('skipped'):
            print("\nCreating HNSW vector index...")
            create_hnsw_vector_index(
                collection,
                config['vector_field'],
                config['dimensions']
            )

            # Allow time for index to become ready
            import time
            print("Waiting for index to be ready...")
            time.sleep(2)

        # Demonstrate HNSW search with various queries
        query = "quintessential lodging near running trails, eateries, retail"

        results = perform_hnsw_vector_search(
            collection,
            azure_openai_client,
            query,
            config['vector_field'],
            config['model_name'],
            top_k=5,
            ef_search=16
        )

        # Display the search results
        print_search_results(results, max_results=5, show_score=True)


    except Exception as e:
        print(f"\nError during HNSW demonstration: {e}")
        raise

    finally:
        # Clean up MongoDB connection
        if 'mongo_client' in locals():
            mongo_client.close()


if __name__ == "__main__":
    main()

此主模块提供以下功能：

包括实用工具函数
为环境变量创建配置对象
为 Azure OpenAI 和 Azure DocumentDB 创建客户端
连接到 MongoDB、创建数据库和集合、插入数据以及创建标准索引
使用 IVF、HNSW 或 DiskANN 创建矢量索引
使用 OpenAI 客户端为示例查询文本创建嵌入。可以更改文件顶部的查询
使用嵌入运行矢量搜索并输出结果

创建实用工具函数

将以下代码粘贴到 utils.py：

import json
import os
import time
import warnings
from typing import Dict, List, Any, Optional, Tuple

# Suppress the PyMongo CosmosDB cluster detection warning
# Must be set before importing pymongo
warnings.filterwarnings(
    "ignore",
    message="You appear to be connected to a CosmosDB cluster.*",
)

from pymongo import MongoClient, InsertOne
from pymongo.collection import Collection
from pymongo.errors import BulkWriteError
from azure.identity import DefaultAzureCredential
from pymongo.auth_oidc import OIDCCallback, OIDCCallbackContext, OIDCCallbackResult
from openai import AzureOpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

class AzureIdentityTokenCallback(OIDCCallback):
    def __init__(self, credential):
        self.credential = credential

    def fetch(self, context: OIDCCallbackContext) -> OIDCCallbackResult:
        token = self.credential.get_token(
            "https://ossrdbms-aad.database.windows.net/.default").token
        return OIDCCallbackResult(access_token=token)

def get_clients() -> Tuple[MongoClient, AzureOpenAI]:

    # Get MongoDB connection string - required for DocumentDB access
    mongo_connection_string = os.getenv("MONGO_CONNECTION_STRING")
    if not mongo_connection_string:
        raise ValueError("MONGO_CONNECTION_STRING environment variable is required")

    # Create MongoDB client with optimized settings for DocumentDB
    mongo_client = MongoClient(
        mongo_connection_string,
        maxPoolSize=50,  # Allow up to 50 connections for better performance
        minPoolSize=5,   # Keep minimum 5 connections open
        maxIdleTimeMS=30000,  # Close idle connections after 30 seconds
        serverSelectionTimeoutMS=5000,  # 5 second timeout for server selection
        socketTimeoutMS=20000  # 20 second socket timeout
    )

    # Get Azure OpenAI configuration
    azure_openai_endpoint = os.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT")
    azure_openai_key = os.getenv("AZURE_OPENAI_EMBEDDING_KEY")

    if not azure_openai_endpoint or not azure_openai_key:
        raise ValueError("Azure OpenAI endpoint and key are required")

    # Create Azure OpenAI client for generating embeddings
    azure_openai_client = AzureOpenAI(
        azure_endpoint=azure_openai_endpoint,
        api_key=azure_openai_key,
        api_version=os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION", "2023-05-15")
    )

    return mongo_client, azure_openai_client


def get_clients_passwordless() -> Tuple[MongoClient, AzureOpenAI]:

    # Get MongoDB cluster name for passwordless authentication
    cluster_name = os.getenv("MONGO_CLUSTER_NAME")
    if not cluster_name:
        raise ValueError("MONGO_CLUSTER_NAME environment variable is required")

    # Create credential object for Azure authentication
    credential = DefaultAzureCredential()

    authProperties = {"OIDC_CALLBACK": AzureIdentityTokenCallback(credential)}

    # Create MongoDB client with Azure AD token callback
    mongo_client = MongoClient(
        f"mongodb+srv://{cluster_name}.global.mongocluster.cosmos.azure.com/",
        connectTimeoutMS=120000,
        tls=True,
        retryWrites=True,
        authMechanism="MONGODB-OIDC",
        authMechanismProperties=authProperties
    )

    # Get Azure OpenAI endpoint
    azure_openai_endpoint = os.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT")
    if not azure_openai_endpoint:
        raise ValueError("AZURE_OPENAI_EMBEDDING_ENDPOINT environment variable is required")

    # Create Azure OpenAI client with credential-based authentication
    azure_openai_client = AzureOpenAI(
        azure_endpoint=azure_openai_endpoint,
        azure_ad_token_provider=lambda: credential.get_token("https://cognitiveservices.azure.com/.default").token,
        api_version=os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION", "2023-05-15")
    )

    return mongo_client, azure_openai_client


def azure_identity_token_callback(credential: DefaultAzureCredential) -> str:

    # DocumentDB requires this specific scope
    token_scope = "https://cosmos.azure.com/.default"

    # Get token from Azure AD
    token = credential.get_token(token_scope)

    return token.token


def read_file_return_json(file_path: str) -> List[Dict[str, Any]]:

    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            return json.load(file)
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found")
        raise
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON in file '{file_path}': {e}")
        raise


def write_file_json(data: List[Dict[str, Any]], file_path: str) -> None:

    try:
        with open(file_path, 'w', encoding='utf-8') as file:
            json.dump(data, file, indent=2, ensure_ascii=False)
        print(f"Data successfully written to '{file_path}'")
    except IOError as e:
        print(f"Error writing to file '{file_path}': {e}")
        raise


def insert_data(collection: Collection, data: List[Dict[str, Any]],
                batch_size: int = 100, index_fields: Optional[List[str]] = None) -> Dict[str, int]:

    total_documents = len(data)

    # Check if data already exists in the collection
    existing_count = collection.count_documents({})
    if existing_count >= total_documents:
        print(f"Collection already has {existing_count} documents, skipping insert and index creation")
        return {'total': total_documents, 'inserted': 0, 'failed': 0, 'skipped': True}

    # Clear existing data if counts don't match to ensure clean state
    if existing_count > 0:
        print(f"Collection has {existing_count} documents but expected {total_documents}, clearing and re-inserting...")
        collection.delete_many({})

    inserted_count = 0
    failed_count = 0

    print(f"Starting batch insertion of {total_documents} documents...")

    # Create indexes if specified
    if index_fields:
        for field in index_fields:
            try:
                collection.create_index(field)
                print(f"Created index on field: {field}")
            except Exception as e:
                print(f"Warning: Could not create index on {field}: {e}")

    # Process data in batches to manage memory and error recovery
    for i in range(0, total_documents, batch_size):
        batch = data[i:i + batch_size]
        batch_num = (i // batch_size) + 1
        total_batches = (total_documents + batch_size - 1) // batch_size

        try:
            # Prepare bulk insert operations
            operations = [InsertOne(document) for document in batch]

            # Execute bulk insert
            result = collection.bulk_write(operations, ordered=False)
            inserted_count += result.inserted_count

            print(f"Batch {batch_num} completed: {result.inserted_count} documents inserted")

        except BulkWriteError as e:
            # Handle partial failures in bulk operations
            inserted_count += e.details.get('nInserted', 0)
            failed_count += len(batch) - e.details.get('nInserted', 0)

            print(f"Batch {batch_num} had errors: {e.details.get('nInserted', 0)} inserted, "
                  f"{failed_count} failed")

            # Print specific error details for debugging
            for error in e.details.get('writeErrors', []):
                print(f"  Error: {error.get('errmsg', 'Unknown error')}")

        except Exception as e:
            # Handle unexpected errors
            failed_count += len(batch)
            print(f"Batch {batch_num} failed completely: {e}")

        # Small delay between batches to avoid overwhelming the database
        time.sleep(0.1)

    # Return summary statistics
    stats = {
        'total': total_documents,
        'inserted': inserted_count,
        'failed': failed_count
    }

    return stats


def drop_vector_indexes(collection, vector_field: str) -> None:

    try:
        # Get all indexes for the collection
        indexes = list(collection.list_indexes())

        # Find vector indexes on the specified field
        vector_indexes = []
        for index in indexes:
            if 'key' in index and vector_field in index['key']:
                if index['key'][vector_field] == 'cosmosSearch':
                    vector_indexes.append(index['name'])

        # Drop each vector index found
        for index_name in vector_indexes:
            print(f"Dropping existing vector index: {index_name}")
            collection.drop_index(index_name)

        if vector_indexes:
            print(f"Dropped {len(vector_indexes)} existing vector index(es)")
        else:
            print("No existing vector indexes found to drop")

    except Exception as e:
        print(f"Warning: Could not drop existing vector indexes: {e}")
        # Continue anyway - the error might be that no indexes exist


def print_search_resultsx(results: List[Dict[str, Any]],
                        max_results: int = 5,
                        show_score: bool = True) -> None:

    if not results:
        print("No search results found.")
        return

    print(f"\nSearch Results (showing top {min(len(results), max_results)}):")
    print("=" * 80)

    for i, result in enumerate(results[:max_results], 1):

        # Display hotel name and ID
        print(f"HotelName: {result['HotelName']}, Score: {result['score']:.4f}")

def print_search_results(results: List[Dict[str, Any]],
                        max_results: int = 5,
                        show_score: bool = True) -> None:

    if not results:
        print("No search results found.")
        return

    print(f"\nSearch Results (showing top {min(len(results), max_results)}):")
    print("=" * 80)

    for i, result in enumerate(results[:max_results], 1):

        # Check if results are nested under 'document' (when using $$ROOT)
        if 'document' in result:
            doc = result['document']
        else:
            doc = result

        # Display hotel name and ID
        print(f"HotelName: {doc['HotelName']}, Score: {result['score']:.4f}")


    if len(results) > max_results:
        print(f"\n... and {len(results) - max_results} more results")

此实用工具模块提供以下功能：

get_clients：为 Azure OpenAI 和 Azure DocumentDB 创建并返回客户端
get_clients_passwordless：使用无密码身份验证为 Azure OpenAI 和 Azure DocumentDB 创建和返回客户端
azure_identity_token_callback：获取 MongoDB OIDC 身份验证使用的 Azure AD 令牌
read_file_return_json：读取 JSON 文件并将其内容作为对象数组返回
write_file_json：将对象数组写入 JSON 文件
insert_data：将数据批量插入 MongoDB 集合，并在指定字段上创建标准索引
drop_vector_indexes：在目标向量字段上删除现有矢量索引
print_search_results：打印矢量搜索结果，包括分数和酒店名称

使用 Azure CLI 进行身份验证

在运行应用程序之前登录到 Azure CLI，以便它可以安全地访问 Azure 资源。

az login

该代码使用本地开发人员身份验证访问 Azure DocumentDB 和 Azure OpenAI。设置 AZURE_TOKEN_CREDENTIALS=AzureCliCredential后，此设置会告知函数以 确定方式使用 Azure CLI 凭据进行身份验证。身份验证依赖于 azure-identity 中的 DefaultAzureCredential，以便在环境中查找 Azure 凭据。详细了解如何使用 Azure 标识库向 Azure 服务验证 Python 应用。

运行应用程序

若要运行 Python 脚本，请执行以下作：

python src/diskann.py

python src/ivf.py

python src/hnsw.py

可以看到与矢量搜索查询及其相似性分数匹配的前五家酒店。

在 Visual Studio Code 中查看和管理数据

在 Visual Studio Code 中选择 DocumentDB 扩展以连接到 Azure DocumentDB 帐户。
查看 Hotels 数据库中的数据和索引。

清理资源

当不需要资源组、Azure DocumentDB 帐户和 Azure OpenAI 资源时，请删除这些资源组、Azure DocumentDB 帐户和 Azure OpenAI 资源，以避免产生额外费用。

反馈

此页面是否有帮助？

Last updated on 2026-02-21