你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

教程:生成 RAG 应用程序

构建一个检索增强生成(RAG)应用程序,完全在设备上用于回答有关文档集合的问题。 RAG 将基于嵌入的搜索与聊天模型相结合,以便答案以你自己的数据为基础,而不是只依赖于模型的训练知识。

本教程介绍如何:

  • 设置项目并安装 Foundry 本地 SDK
  • 创建文本文档的知识库
  • 为文档生成嵌入内容
  • 按相似性搜索相关文档
  • 在检索的上下文中生成基于上下文的答案
  • 清理资源

先决条件

  • 至少具有 8 GB RAM 的 Windows、macOS 或 Linux 计算机。
  • 已安装 Python 3.11 或更高版本。

安装软件包

如果您在 Windows 上开发或发布,请选择“Windows”选项卡。Windows 软件包与Windows ML 运行时集成,它提供相同的 API 接口,并拥有更广泛的硬件加速功能。

pip install foundry-local-sdk-winml openai

创建知识库

RAG 应用程序需要一组文档才能搜索。 定义用作知识库的文本字符串列表。 在生产应用程序中,列表可以是来自文件、数据库记录或任何其他文本源的段落。

创建一个名为main.py的文件,并添加以下代码:

import math
from foundry_local_sdk import Configuration, FoundryLocalManager

# Knowledge base — each string represents a document
documents = [
    "Foundry Local runs AI models directly on your device without cloud connectivity.",
    "The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
    "Embedding models convert text into numerical vectors for similarity search.",
    "Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
    "The model catalog provides pre-optimized models that you can download and run locally.",
    "Retrieval-augmented generation grounds model responses in your own data.",
    "Vector similarity search finds documents that are semantically close to a query.",
    "Chat completions generate natural language responses from a prompt and context.",
]

生成文档嵌入

初始化 SDK,加载嵌入模型,并将每个文档转换为数字向量。 这些向量表示每个文档的语义含义,并启用相似性搜索。

将以下代码添加到 main.py

def main():
    # Initialize the SDK
    config = Configuration(app_name="foundry_local_rag")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Load the embedding model
    embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
    embedding_model.download(
        lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
    )
    print()
    embedding_model.load()
    embedding_client = embedding_model.get_embedding_client()

    # Embed all documents in a single batch call
    response = embedding_client.generate_embeddings(documents)
    doc_embeddings = [item.embedding for item in response.data]
    print(f"Indexed {len(doc_embeddings)} documents.")

该方法 generate_embeddings 接受字符串列表,并为每个输入返回一个向量。 每个向量捕获文本的语义含义,因此类似的文档会生成嵌入空间中紧密相连的向量。

搜索相关文档

若要查找与查询相关的文档,请使用余弦相似性将查询的嵌入与每个文档嵌入进行比较。 余弦相似性测量两个向量的方向有多近,而不考虑数量级。 接近 1.0 的值表示高相似性。

main.py 中,添加以下帮助程序函数于 main() 上方:

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def find_relevant(query_embedding, doc_embeddings, top_k=2):
    """Return the indices and scores of the top-k most similar documents."""
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

find_relevant 函数按相似性对所有文档进行排名,并返回最匹配项。 此方法适用于小型集合。 对于较大的数据集,请考虑使用专用矢量数据库。

生成基础答案

加载聊天模型,并将检索的文档与系统提示中的用户问题组合在一起。 聊天模型使用提供的上下文生成基于文档的答案。

在嵌入节之后,将以下代码添加到 main() 函数:

    # Load the chat model
    chat_model = manager.catalog.get_model("qwen2.5-0.5b")
    chat_model.download(
        lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
    )
    print()
    chat_model.load()
    chat_client = chat_model.get_chat_client()

    print("\nModels loaded. Ready for questions.")
    print("\nThe knowledge base contains information about:")
    print("  - Foundry Local features and architecture")
    print("  - Supported programming languages")
    print("  - Embedding models and vector search")
    print("  - ONNX Runtime inference")
    print("  - The model catalog")
    print("  - RAG and chat completions")
    print("\nExample questions:")
    print('  "What programming languages does the SDK support?"')
    print('  "How does Foundry Local run models?"')
    print('  "What is retrieval-augmented generation?"')
    print('\nType "quit" to exit.\n')

    # Interactive query loop
    while True:
        query = input("Question: ").strip()
        if not query or query.lower() == "quit":
            break

        # Embed the query
        query_response = embedding_client.generate_embedding(query)
        query_embedding = query_response.data[0].embedding

        # Retrieve the most relevant documents
        results = find_relevant(query_embedding, doc_embeddings, top_k=2)
        context = "\n".join(f"- {documents[i]}" for i, _ in results)

        # Build the prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided context. "
                    "If the context doesn't contain enough information, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ]

        # Stream the response
        print("Answer: ", end="", flush=True)
        for chunk in chat_client.complete_streaming_chat(messages):
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n")

    # Clean up
    embedding_model.unload()
    chat_model.unload()
    print("Models unloaded. Done!")


if __name__ == "__main__":
    main()

系统提示指示模型仅使用检索到的上下文进行答案。 系统提示将响应保留在文档中,并减少不正确的答案。 流式处理输出在生成时输出每个标记,使响应更加互动。

完成代码

下面是合并所有步骤的完整应用程序:

import math
from foundry_local_sdk import Configuration, FoundryLocalManager

# Knowledge base
documents = [
    "Foundry Local runs AI models directly on your device without cloud connectivity.",
    "The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
    "Embedding models convert text into numerical vectors for similarity search.",
    "Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
    "The model catalog provides pre-optimized models that you can download and run locally.",
    "Retrieval-augmented generation grounds model responses in your own data.",
    "Vector similarity search finds documents that are semantically close to a query.",
    "Chat completions generate natural language responses from a prompt and context.",
]


def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def find_relevant(query_embedding, doc_embeddings, top_k=2):
    """Return the indices and scores of the top-k most similar documents."""
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]


def main():
    # Initialize the SDK
    config = Configuration(app_name="foundry_local_rag")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Load the embedding model
    embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
    embedding_model.download(
        lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
    )
    print()
    embedding_model.load()
    embedding_client = embedding_model.get_embedding_client()

    # Embed all documents
    response = embedding_client.generate_embeddings(documents)
    doc_embeddings = [item.embedding for item in response.data]
    print(f"Indexed {len(doc_embeddings)} documents.")

    # Load the chat model
    chat_model = manager.catalog.get_model("qwen2.5-0.5b")
    chat_model.download(
        lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
    )
    print()
    chat_model.load()
    chat_client = chat_model.get_chat_client()

    print("\nModels loaded. Ready for questions.")
    print("\nThe knowledge base contains information about:")
    print("  - Foundry Local features and architecture")
    print("  - Supported programming languages")
    print("  - Embedding models and vector search")
    print("  - ONNX Runtime inference")
    print("  - The model catalog")
    print("  - RAG and chat completions")
    print("\nExample questions:")
    print('  "What programming languages does the SDK support?"')
    print('  "How does Foundry Local run models?"')
    print('  "What is retrieval-augmented generation?"')
    print('\nType "quit" to exit.\n')

    # Interactive query loop
    while True:
        query = input("Question: ").strip()
        if not query or query.lower() == "quit":
            break

        # Embed the query
        query_response = embedding_client.generate_embedding(query)
        query_embedding = query_response.data[0].embedding

        # Retrieve the most relevant documents
        results = find_relevant(query_embedding, doc_embeddings, top_k=2)
        context = "\n".join(f"- {documents[i]}" for i, _ in results)

        # Build the prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided context. "
                    "If the context doesn't contain enough information, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ]

        # Stream the response
        print("Answer: ", end="", flush=True)
        for chunk in chat_client.complete_streaming_chat(messages):
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n")

    # Clean up
    embedding_model.unload()
    chat_model.unload()
    print("Models unloaded. Done!")


if __name__ == "__main__":
    main()

运行应用程序:

python main.py

会看到类似于以下内容的输出:

Downloading embedding model: 100.0%
Indexed 8 documents.
Downloading chat model: 100.0%

Models loaded. Ready for questions.

The knowledge base contains information about:
  - Foundry Local features and architecture
  - Supported programming languages
  - Embedding models and vector search
  - ONNX Runtime inference
  - The model catalog
  - RAG and chat completions

Example questions:
  "What programming languages does the SDK support?"
  "How does Foundry Local run models?"
  "What is retrieval-augmented generation?"

Type "quit" to exit.

Question: What programming languages does the SDK support?
Answer: The Foundry Local SDK supports Python, C#, JavaScript, and Rust.

Question: quit
Models unloaded. Done!

清理资源

卸载模型后,模型权重将保留在本地缓存中。 下次运行应用程序时,将跳过下载步骤,模型加载速度更快。 除非需要回收磁盘空间,否则不需要进行额外的清理。