共用方式為


語意核心 Python 向量存放區移轉指南

概觀

本指南涵蓋語意核心 1.34 版中引進的主要向量存放區更新,這代表向量存放區實作的重大大修,以配合 .NET SDK,並提供更統一、更直覺的 API。 這些變更會合併下 semantic_kernel.data.vector 的所有專案,並改善連接器架構。

重要改進摘要

  • 整合欄位模型:單一 VectorStoreField 類別取代多個字段類型
  • 整合嵌入:在向量場規範中直接生成嵌入
  • 簡化的搜尋:直接在集合上輕鬆建立搜尋函式
  • 合併結構semantic_kernel.data.vectorsemantic_kernel.connectors 下的所有內容
  • 增強型文字搜尋:使用簡化的連接器改善文字搜尋功能
  • 棄用:舊 memory_stores 技術已被棄用,取而代之的是新的向量存放區架構

1.整合式內嵌和向量存放區模型/欄位更新

定義向量存放區模型的方式有許多變更,最大的是我們現在支援直接在向量存放區欄位定義中整合式內嵌。 這表示當您將欄位指定為向量時,該欄位的內容會使用指定的內嵌產生器自動內嵌,例如 OpenAI 的文字內嵌模型。 這可簡化建立和管理向量欄位的程式。

當您定義該欄位時,您必須確定三件事,特別是在使用 Pydantic 模型時:

  1. 類型:欄位可能會有三種類型,list[float]str或是嵌入生成器輸入的其他類型,以及None在欄位未設定時。
  2. 預設值:字段必須設定為 None 或其他值,以確保在使用現在的預設 include_vectors=Falsegetsearch 取得記錄時不會出錯。

這裡有兩個考慮,第一是當修飾類別vectorstoremodel時,欄位的第一個類型批註會被用作填入VectorStoreField類別的type參數,因此您必須確保第一個類型批註是用來建立向量存放區集合的正確類型,通常是 list[float]。 根據預設, getsearch 方法不會在結果中include_vectors,因此欄位需要預設值,而且輸入需要對應到該值,因此通常 None 允許,而且預設值會設定為 None。 建立欄位時,需要內嵌的值會在此欄位中,通常是字串,因此 str 也需要包含。 這項變更的原因在於讓內嵌的內容和實際儲存在數據欄位中的項目有更多的彈性,這是一種常見的設定。

from semantic_kernel.data.vector import VectorStoreField, vectorstoremodel
from typing import Annotated
from dataclasses import dataclass

@vectorstoremodel
@dataclass
class MyRecord:
    content: Annotated[str, VectorStoreField('data', is_indexed=True, is_full_text_indexed=True)]
    title: Annotated[str, VectorStoreField('data', is_indexed=True, is_full_text_indexed=True)]
    id: Annotated[str, VectorStoreField('key')]
    vector: Annotated[list[float] | str | None, VectorStoreField(
        'vector', 
        dimensions=1536, 
        distance_function="cosine",
        embedding_generator=OpenAITextEmbedding(ai_model_id="text-embedding-3-small"),
    )] = None

    def __post_init__(self):
        if self.vector is None:
            self.vector = f"Title: {self.title}, Content: {self.content}"

請注意 post_init 方法,這會建立一個內嵌的值,該值包含多個字段。 這三種類型也存在。

之前:個別欄位類別

from semantic_kernel.data import (
    VectorStoreRecordKeyField,
    VectorStoreRecordDataField, 
    VectorStoreRecordVectorField
)

# Old approach with separate field classes
fields = [
    VectorStoreRecordKeyField(name="id"),
    VectorStoreRecordDataField(name="text", is_filterable=True, is_full_text_searchable=True),
    VectorStoreRecordVectorField(name="vector", dimensions=1536, distance_function="cosine")
]

之後:將 VectorStoreField 和內嵌整合一起

from semantic_kernel.data.vector import VectorStoreField
from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbedding

# New unified approach with integrated embeddings
embedding_service = OpenAITextEmbedding(
    ai_model_id="text-embedding-3-small"
)

fields = [
    VectorStoreField(
        "key",
        name="id",
    ),
    VectorStoreField(
        "data",
        name="text",
        is_indexed=True,  # Previously is_filterable
        is_full_text_indexed=True  # Previously is_full_text_searchable
    ),
    VectorStoreField(
        "vector",
        name="vector",
        dimensions=1536,
        distance_function="cosine",
        embedding_generator=embedding_service  # Integrated embedding generation
    )
]

欄位定義中的主要變更

  1. 單一欄位類別VectorStoreField 取代所有先前的欄位類型
  2. 欄位類型規格:使用 field_type: Literal["key", "data", "vector"] 參數,這可以是位置參數,因此 VectorStoreField("key") 有效。
  3. 增強屬性
    • storage_name 已新增,設定時會作為向量存放區中的欄位名稱使用,否則會使用 name 參數。
    • dimensions 現在是向量欄位的必要參數。
    • distance_functionindex_kind都是可選的,如果未指定,則僅限於向量欄位會分別設定為DistanceFunction.DEFAULTIndexKind.DEFAULT。每個向量存放區的實作都有選擇該存放區預設值的邏輯。
  4. 屬性重新命名
    • property_typetype_ 屬性和 type 建構函式→
    • is_filterableis_indexed
    • is_full_text_searchableis_full_text_indexed
  5. 整合式內嵌:直接新增 embedding_generator 至向量欄位,或者您可以在向量存放區集合本身上設定 embedding_generator ,這會用於該存放區中的所有向量字段,此值優先於集合層級內嵌產生器。

2. 儲存與集合上的新方法

增強型存放區介面

from semantic_kernel.connectors.in_memory import InMemoryStore

# Before: Limited collection methods
collection = InMemoryStore.get_collection("my_collection", record_type=MyRecord)

# After: Slimmer collection interface with new methods
collection = InMemoryStore.get_collection(MyRecord)
# if the record type has the `vectorstoremodel` decorator it can contain both the collection_name and the definition for the collection.

# New methods for collection management
await store.collection_exists("my_collection")
await store.ensure_collection_deleted("my_collection")
# both of these methods, create a simple model to streamline doing collection management tasks.
# they both call the underlying `VectorStoreCollection` methods, see below.

增強型集合介面

from semantic_kernel.connectors.in_memory import InMemoryCollection

collection = InMemoryCollection(
    record_type=MyRecord,
    embedding_generator=OpenAITextEmbedding(ai_model_id="text-embedding-3-small")  # Optional, if there is no embedding generator set on the record type
)
# If both the collection and the record type have an embedding generator set, the record type's embedding generator will be used for the collection. If neither is set, it is assumed the vector store itself can create embeddings, or that vectors are included in the records already, if that is not the case, it will likely raise.

# Enhanced collection operations
await collection.collection_exists()
await collection.ensure_collection_exists()
await collection.ensure_collection_deleted()

# CRUD methods
# Removed batch operations, all CRUD operations can now take both a single record or a list of records
records = [
    MyRecord(id="1", text="First record"),
    MyRecord(id="2", text="Second record")
]
ids = ["1", "2"]
# this method adds vectors automatically
await collection.upsert(records)

# You can do get with one or more ids, and it will return a list of records
await collection.get(ids)  # Returns a list of records
# you can also do a get without ids, with top, skip and order_by parameters
await collection.get(top=10, skip=0, order_by='id')
# the order_by parameter can be a string or a dict, with the key being the field name and the value being True for ascending or False for descending order.
# At this time, not all vector stores support this method.

# Delete also allows for single or multiple ids
await collection.delete(ids)

query = "search term"
# New search methods, these use the built-in embedding generator to take the value and create a vector
results = await collection.search(query, top=10)
results = await collection.hybrid_search(query, top=10)

# You can also supply a vector directly
query_vector = [0.1, 0.2, 0.3]  # Example vector
results = await collection.search(vector=query_vector, top=10)
results = await collection.hybrid_search(query, vector=query_vector, top=10)

新的向量存放區實作會從字串型 FilterClause 物件移至功能更強大且更安全的 Lambda 運算式或可呼叫的篩選。

之前:FilterClause 物件

from semantic_kernel.data.text_search import SearchFilter, EqualTo, AnyTagsEqualTo
from semantic_kernel.data.vector_search import VectorSearchFilter

# Creating filters using FilterClause objects
text_filter = SearchFilter()
text_filter.equal_to("category", "AI")
text_filter.equal_to("status", "active")

# Vector search filters
vector_filter = VectorSearchFilter()
vector_filter.equal_to("category", "AI")
vector_filter.any_tag_equal_to("tags", "important")

# Using in search
results = await collection.search(
    "query text",
    options=VectorSearchOptions(filter=vector_filter)
)

Lambda 運算式篩選之後

# When defining the collection with the generic type hints, most IDE's will be able to infer the type of the record, so you can use the record type directly in the lambda expressions.
collection = InMemoryCollection[str, MyRecord](MyRecord)

# Using lambda expressions for more powerful and type-safe filtering
# The code snippets below work on a data model with more fields then defined earlier.

# Direct lambda expressions
results = await collection.search(
    "query text", 
    filter=lambda record: record.category == "AI" and record.status == "active"
)

# Complex filtering with multiple conditions
results = await collection.search(
    "query text",
    filter=lambda record: (
        record.category == "AI" and 
        record.score > 0.8 and
        "important" in record.tags
    )
)

# Combining conditions with boolean operators
results = await collection.search(
    "query text",
    filter=lambda record: (
        record.category == "AI" or record.category == "ML"
    ) and record.published_date >= datetime(2024, 1, 1)
)

# Range filtering (now possible with lambda expressions)
results = await collection.search(
    "query text",
    filter=lambda record: 0.5 <= record.confidence_score <= 0.9
)

篩選器遷移技巧

  1. 簡單相等filter.equal_to("field", "value") 變成 lambda r: r.field == "value"
  2. 多個條件:使用 and/or 運算符鏈結,而不是多個篩選呼叫
  3. 標記/陣列內容filter.any_tag_equal_to("tags", "value") 變成 lambda r: "value" in r.tags
  4. 增強功能:支援範圍查詢、複雜的布爾邏輯和自定義述詞

4. 改善建立搜尋函式的便利性

之前:使用 VectorStoreTextSearch 建立搜尋函式

from semantic_kernel.connectors.in_memory import InMemoryCollection
from semantic_kernel.data import VectorStoreTextSearch

collection = InMemoryCollection(collection_name='collection', record_type=MyRecord)
search = VectorStoreTextSearch.from_vectorized_search(vectorized_search=collection, embedding_generator=OpenAITextEmbedding(ai_model_id="text-embedding-3-small"))

search_function = search.create_search(
    function_name='search',
    ...
)

接下來:直接搜尋功能的建立

collection = InMemoryCollection(MyRecord)
# Create search function directly on collection
search_function = collection.create_search_function(
    function_name="search",
    search_type="vector",  # or "keyword_hybrid"
    top=10,
    vector_property_name="vector",  # Name of the vector field
)

# Add to kernel directly
kernel.add_function(plugin_name="memory", function=search_function)

5.連接器重新命名和匯入變更

匯入路徑匯總

# Before: Scattered imports
from semantic_kernel.connectors.memory.azure_cognitive_search import AzureCognitiveSearchMemoryStore
from semantic_kernel.connectors.memory.chroma import ChromaMemoryStore
from semantic_kernel.connectors.memory.pinecone import PineconeMemoryStore
from semantic_kernel.connectors.memory.qdrant import QdrantMemoryStore

# After: Consolidated under connectors
from semantic_kernel.connectors.azure_ai_search import AzureAISearchStore
from semantic_kernel.connectors.chroma import ChromaVectorStore
from semantic_kernel.connectors.pinecone import PineconeVectorStore
from semantic_kernel.connectors.qdrant import QdrantVectorStore

# Alternative after: Consolidated with lazy loading:
from semantic_kernel.connectors.memory import (
    AzureAISearchStore,
    ChromaVectorStore,
    PineconeVectorStore,
    QdrantVectorStore,
    WeaviateVectorStore,
    RedisVectorStore
)

連接器類別重新命名

舊名稱 新名稱
AzureCosmosDBforMongoDB* CosmosMongo*
AzureCosmosDBForNoSQL* CosmosNoSql*

6. 文字搜尋改善和移除 Bing 連接器

Bing 連接器已移除,且文字搜尋介面已增強

已移除 Bing 文字搜尋連接器。 移轉至替代搜尋提供者:

# Before: Bing Connector (REMOVED)
from semantic_kernel.connectors.search.bing import BingConnector

bing_search = BingConnector(api_key="your-bing-key")

# After: Use Brave Search or other providers
from semantic_kernel.connectors.brave import BraveSearch
# or
from semantic_kernel.connectors.search import BraveSearch

brave_search = BraveSearch()

# Create text search function
text_search_function = brave_search.create_search_function(
    function_name="web_search",
    query_parameter_name="query",
    description="Search the web for information"
)

kernel.add_function(plugin_name="search", function=text_search_function)

改善的搜尋方法

之前:具有不同傳回類型的三個不同的搜尋方法

from semantic_kernel.connectors.brave import BraveSearch
brave_search = BraveSearch()
# Before: Separate search methods
search_results: KernelSearchResult[str] = await brave_search.search(
    query="semantic kernel python",
    top=5,
)

search_results: KernelSearchResult[TextSearchResult] = await brave_search.get_text_search_results(
    query="semantic kernel python",
    top=5,
)

search_results: KernelSearchResult[BraveWebPage] = await brave_search.get_search_results(
    query="semantic kernel python",
    top=5,
)

之後:具有輸出類型參數的整合搜尋方法

from semantic_kernel.data.text_search import SearchOptions
# Enhanced search results with metadata
search_results: KernelSearchResult[str] = await brave_search.search(
    query="semantic kernel python",
    output_type=str, # can also be TextSearchResult or anything else for search engine specific results, default is `str`
    top=5,
    filter=lambda result: result.country == "NL",  # Example filter
)

async for result in search_results.results:
    assert isinstance(result, str)  # or TextSearchResult if using that type
    print(f"Result: {result}")
    print(f"Metadata: {search_results.metadata}")

7. 廢棄舊記憶體存放區

所有依據 MemoryStoreBase 的舊記憶體存放區已移入 semantic_kernel.connectors.memory_stores,並現在標記為已廢棄。 其中大部分都有以 VectorStore 和 VectorStoreCollection 為基礎的對等新實作,可在 semantic_kernel.connectors.memory 中找到。

這些連接器將會完全移除:

  • AstraDB
  • Milvus
  • Usearch

如果您需要上述任一項,請務必從已被取代的模組和semantic_kernel.memory資料夾接管程式代碼,或根據新的VectorStoreCollection類別實作您自己的向量存放區集合

如果來自 GitHub 的意見反應顯示有大量需求,我們會考慮重新引入,但目前不會進行維護,且未來將被移除。

從 SemanticTextMemory 移轉

# Before: SemanticTextMemory (DEPRECATED)
from semantic_kernel.memory import SemanticTextMemory
from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbeddingGenerationService

embedding_service = OpenAITextEmbeddingGenerationService(ai_model_id="text-embedding-3-small")
memory = SemanticTextMemory(storage=vector_store, embeddings_generator=embedding_service)

# Store memory
await memory.save_information(collection="docs", text="Important information", id="doc1")

# Search memory  
results = await memory.search(collection="docs", query="important", limit=5)
# After: Direct Vector Store Usage
from semantic_kernel.data.vector import VectorStoreField, vectorstoremodel
from semantic_kernel.connectors.in_memory import InMemoryCollection

# Define data model
@vectorstoremodel
@dataclass
class MemoryRecord:
    id: Annotated[str, VectorStoreField('key')]
    text: Annotated[str, VectorStoreField('data', is_full_text_indexed=True)]
    embedding: Annotated[list[float] | str | None, VectorStoreField('vector', dimensions=1536, distance_function="cosine", embedding_generator=OpenAITextEmbedding(ai_model_id="text-embedding-3-small"))] = None

# Create vector store with integrated embeddings
collection = InMemoryCollection(
    record_type=MemoryRecord,
    embedding_generator=OpenAITextEmbedding(ai_model_id="text-embedding-3-small")  # Optional, if not set on the record type
)

# Store with automatic embedding generation
record = MemoryRecord(id="doc1", text="Important information", embedding='Important information')
await collection.upsert(record)

# Search with built-in function
search_function = collection.create_search_function(
    function_name="search_docs",
    search_type="vector"
)

記憶體外掛程式移轉

當您想要有也可以儲存資訊的外掛程式時,您可以輕鬆地建立如下的外掛程式:

# Before: TextMemoryPlugin (DEPRECATED)
from semantic_kernel.core_plugins import TextMemoryPlugin

memory_plugin = TextMemoryPlugin(memory)
kernel.add_plugin(memory_plugin, "memory")
# After: Custom plugin using vector store search functions
from semantic_kernel.functions import kernel_function

class VectorMemoryPlugin:
    def __init__(self, collection: VectorStoreCollection):
        self.collection = collection
    
    @kernel_function(name="save")
    async def save_memory(self, text: str, key: str) -> str:
        record = MemoryRecord(id=key, text=text, embedding=text)
        await self.collection.upsert(record)
        return f"Saved to {self.collection.collection_name}"
    
    @kernel_function(name="search") 
    async def search_memory(self, query: str, limit: int = 5) -> str:
        results = await self.collection.search(
            query, top=limit, vector_property_name="embedding"
        )        
        return "\n".join([r.record.text async for r in results.results])

# Register the new plugin
memory_plugin = VectorMemoryPlugin(collection)
kernel.add_plugin(memory_plugin, "memory")

步驟 1:更新匯入

  • [ ] 以向量存放區對等專案取代記憶體存放區匯入
  • [ ] 更新欄位匯入以使用 VectorStoreField
  • [ ] 移除 Bing 接頭匯入

步驟 2:更新欄位定義

  • [ ] 轉換為整合 VectorStoreField 類別
  • [ ] 更新屬性名稱 (is_filterableis_indexed
  • [ ] 將整合式內嵌產生器新增至向量欄位

步驟 3:更新集合使用方式

  • [ ] 以向量儲存方法取代記憶體操作
  • [ ] 適用時使用新的批次作業
  • [ ] 實作新的搜尋功能

步驟 4:更新搜尋實作

  • [ ] 將手動搜尋函式取代為 create_search_function
  • [ ] 更新文字搜尋以使用新的提供者
  • [ ] 實作有益的混合式搜尋
  • [ ] 從 FilterClause 移轉至 lambda 表示式以進行篩選

步驟 5:移除已被取代的程序代碼

  • [ ] 移除 SemanticTextMemory 使用量
  • [ ] 移除 TextMemoryPlugin 相依性

效能和功能優點

效能提升

  • 批次操作:新的批次更新/刪除方法可提升吞吐量
  • 整合式內嵌:消除個別的內嵌產生步驟
  • 優化搜尋:內建搜尋函式已針對每個商店類型優化

功能增強

  • 混合式搜尋:結合向量和文字搜尋以取得更好的結果
  • 進階篩選:增強的篩選表達式和索引編製

開發人員體驗

  • 簡化的 API:要學習的類別和方法較少
  • 一致介面:所有向量存放區的整合方法
  • 更好的文件:清晰的範例和移轉路徑
  • 面向未來:與 .NET SDK 對齊,以便進行一致的跨平台開發

結論

上述的向量存放區更新代表語意核心 Python SDK 的大幅改善。 新的整合架構提供更佳的效能、增強的功能,以及更直覺的開發人員體驗。 雖然移轉需要更新匯入和重構現有的程序代碼,但可維護性和功能的優點讓這項升級非常建議。

如需移轉的其他協助,請參閱目錄中的更新範例 samples/concepts/memory/ 和完整的 API 檔。