使用 Java 編製 Azure Cosmos DB for NoSQL 向量資料索引並查詢這些向量資料

發行項
05/21/2024

適用於：NoSQL

Azure Cosmos DB for NoSQL 中的向量搜尋目前是預覽功能。您必須先註冊預覽，才能使用。本文涵蓋下列步驟：

註冊 Azure Cosmos DB for NoSQL 中的向量搜尋預覽
設定 Azure Cosmos DB 容器以進行向量搜尋
製作向量內嵌原則
將向量索引新增至容器索引編製原則
使用向量索引和向量內嵌原則建立容器
對儲存的資料執行向量搜尋。
本指南會逐步解說建立向量資料、編製資料索引、然後查詢容器中資料的流程。

必要條件

現有的 Azure Cosmos DB for NoSQL 帳戶。
- 如果您沒有 Azure 訂用帳戶，可以免費試用 Azure Cosmos DB for NoSQL。
- 如果您有現有的 Azure 訂用帳戶，請建立新的 Azure Cosmos DB for NoSQL 帳戶。
最新版的 Azure Cosmos DB Java SDK。

註冊預覽

Azure Cosmos DB for NoSQL 的向量搜尋需要預覽功能註冊。請遵循下列步驟註冊：

瀏覽至您的 Azure Cosmos DB for NoSQL 資源頁面。
選取 [設定] 功能表項目下的 [功能] 窗格。
選取 [Azure Cosmos DB for NoSQL 中的向量搜尋]。
閱讀功能的描述，以確認您要註冊預覽。
選取 [啟用] 以註冊預覽。

注意

註冊要求將會自動獲得核准，不過可能需要幾分鐘的時間才會生效。

了解向量搜尋中涉及的步驟

下列步驟假設您知道如何設定 Cosmos DB NoSQL 帳戶和建立資料庫。現有容器目前不支援向量搜尋功能，因此您必須建立新的容器，並指定容器層級的向量內嵌原則，以及容器建立時的向量索引編製原則。

讓我們以建立網際網路型的書店資料庫為例，您要儲存每本書的標題、作者、ISBN 和描述。我們也定義了兩個屬性來包含向量內嵌。第一個是“contentVector”屬性，其中包含產生自書籍文字內容 (例如，在建立內嵌之前串連「標題」、「作者」、「ISBN」和「描述」屬性) 的文字內嵌。第二個是從書籍封面影像產生的“coverImageVector”。

針對您要執行向量搜尋的欄位建立和儲存向量內嵌。
指定向量內嵌原則中的向量內嵌路徑。
在容器的索引編製原則中包含任何所需的向量索引。

在本文的後續各節中，我們考慮為儲存在容器中的項目採用下列結構:

{
"title": "book-title", 
"author": "book-author", 
"isbn": "book-isbn", 
"description": "book-description", 
"contentVector": [2, -1, 4, 3, 5, -2, 5, -7, 3, 1], 
"coverImageVector": [0.33, -0.52, 0.45, -0.67, 0.89, -0.34, 0.86, -0.78] 
}

首先，建立 CosmosContainerProperties 物件。

CosmosContainerProperties collectionDefinition = new CosmosContainerProperties(UUID.randomUUID().toString(), "Partition_Key_Def");

建立容器的向量內嵌原則。

接下來，您必須定義容器向量原則。此原則提供的資訊用來告知 Azure Cosmos DB 查詢引擎，如何處理 VectorDistance 系統函數中的向量屬性。這也會告知向量索引編製原則所需的資訊，您應該選擇指定一個。容器向量原則中包含下列資訊:

“path”：包含向量的屬性路徑
“datatype”：向量元素的類型 (預設值 Float32)
“dimensions”：路徑中每個向量的長度 (預設值 1536)
“distanceFunction”：用來計算距離/相似度的計量 (預設值 Cosine)

對於具有書籍詳細資料的範例，向量原則可能看起來類似範例 JSON:

// Creating vector embedding policy
CosmosVectorEmbeddingPolicy cosmosVectorEmbeddingPolicy = new CosmosVectorEmbeddingPolicy();

CosmosVectorEmbedding embedding1 = new CosmosVectorEmbedding();
embedding1.setPath("/coverImageVector");
embedding1.setDataType(CosmosVectorDataType.FLOAT32);
embedding1.setDimensions(8L);
embedding1.setDistanceFunction(CosmosVectorDistanceFunction.COSINE);

CosmosVectorEmbedding embedding2 = new CosmosVectorEmbedding();
embedding2.setPath("/contentVector");
embedding2.setDataType(CosmosVectorDataType.FLOAT32);
embedding2.setDimensions(10L);
embedding2.setDistanceFunction(CosmosVectorDistanceFunction.DOT_PRODUCT);

cosmosVectorEmbeddingPolicy.setCosmosVectorEmbeddings(Arrays.asList(embedding1, embedding2, embedding3));

collectionDefinition.setVectorEmbeddingPolicy(cosmosVectorEmbeddingPolicy);

在索引編製原則中建立向量索引

一旦決定向量內嵌路徑，就必須將向量索引新增至索引編製原則。目前，只有新容器支援 Azure Cosmos DB for NoSQL 的向量搜尋功能，因此您必須在容器建立期間套用向量原則，且之後就無法再修改。本範例中的編製索引原則與下例相似:

IndexingPolicy indexingPolicy = new IndexingPolicy();
indexingPolicy.setIndexingMode(IndexingMode.CONSISTENT);
ExcludedPath excludedPath = new ExcludedPath("/*");
indexingPolicy.setExcludedPaths(Collections.singletonList(excludedPath));

IncludedPath includedPath1 = new IncludedPath("/name/?");
IncludedPath includedPath2 = new IncludedPath("/description/?");
indexingPolicy.setIncludedPaths(ImmutableList.of(includedPath1, includedPath2));

// Creating vector indexes
CosmosVectorIndexSpec cosmosVectorIndexSpec1 = new CosmosVectorIndexSpec();
cosmosVectorIndexSpec1.setPath("/coverImageVector");
cosmosVectorIndexSpec1.setType(CosmosVectorIndexType.QUANTIZED_FLAT.toString());

CosmosVectorIndexSpec cosmosVectorIndexSpec2 = new CosmosVectorIndexSpec();
cosmosVectorIndexSpec2.setPath("/contentVector");
cosmosVectorIndexSpec2.setType(CosmosVectorIndexType.DISK_ANN.toString());

indexingPolicy.setVectorIndexes(Arrays.asList(cosmosVectorIndexSpec1, cosmosVectorIndexSpec2, cosmosVectorIndexSpec3));

collectionDefinition.setIndexingPolicy(indexingPolicy);

最後，使用容器索引原則和向量索引原則建立容器。

database.createContainer(collectionDefinition).block();

重要

目前，Azure Cosmos DB for NoSQL 中的向量搜尋僅在新容器上受到支援。您必須在建立容器期間同時設定容器向量原則和任何向量索引編製原則，因為稍後無法將其修改。這兩個原則在未來的預覽功能改善中都可以修改。

執行向量相似度搜尋查詢

在您建立具有所需向量原則的容器，並將向量資料插入容器後，您就可以在查詢中使用向量距離系統函數來執行向量搜尋。假設您想要藉由查看說明來搜尋食譜的相關書籍，您必須先取得查詢文字的內嵌。在此情況下，您可能會想要針對查詢文字產生內嵌 - 「食譜」。一旦您對搜尋查詢具有內嵌，就可以在向量搜尋查詢的 VectorDistance 函數中使用該內嵌，並取得與您查詢類似的所有項目，如下所示：

SELECT c.title, VectorDistance(c.contentVector, [1,2,3,4,5,6,7,8,9,10]) AS SimilarityScore   
FROM c  
ORDER BY VectorDistance(c.contentVector, [1,2,3,4,5,6,7,8,9,10])

此查詢會擷取書籍標題，以及與查詢有關的相似度分數。以下是 Java:

float[] embedding = new float[10];
for (int i = 0; i < 10; i++) {
    array[i] = i + 1;
}
ArrayList<SqlParameter> paramList = new ArrayList<SqlParameter>();
  paramList.add(new SqlParameter("@embedding", embedding));
  SqlQuerySpec querySpec = new SqlQuerySpec("SELECT c.title, VectorDistance(c.contentVector,@embedding) AS SimilarityScore  FROM c ORDER BY VectorDistance(c.contentVector,@embedding)", paramList);
  CosmosPagedIterable<Family> filteredFamilies = container.queryItems(querySpec, new CosmosQueryRequestOptions(), Family.class);

  if (filteredFamilies.iterator().hasNext()) {
      Family family = filteredFamilies.iterator().next();
      logger.info(String.format("First query result: Family with (/id, partition key) = (%s,%s)",family.getId(),family.getLastName()));
  }

共用方式為