你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
构建一个检索增强生成(RAG)应用程序,完全在设备上用于回答有关文档集合的问题。 RAG 将基于嵌入的搜索与聊天模型相结合,以便答案以你自己的数据为基础,而不是只依赖于模型的训练知识。
本教程介绍如何:
- 设置项目并安装 Foundry 本地 SDK
- 创建文本文档的知识库
- 为文档生成嵌入内容
- 按相似性搜索相关文档
- 在检索的上下文中生成基于上下文的答案
- 清理资源
先决条件
- 至少具有 8 GB RAM 的 Windows、macOS 或 Linux 计算机。
- 已安装 Python 3.11 或更高版本。
安装软件包
如果您在 Windows 上开发或发布,请选择“Windows”选项卡。Windows 软件包与Windows ML 运行时集成,它提供相同的 API 接口,并拥有更广泛的硬件加速功能。
创建知识库
RAG 应用程序需要一组文档才能搜索。 定义用作知识库的文本字符串列表。 在生产应用程序中,列表可以是来自文件、数据库记录或任何其他文本源的段落。
创建一个名为main.py的文件,并添加以下代码:
import math
from foundry_local_sdk import Configuration, FoundryLocalManager
# Knowledge base — each string represents a document
documents = [
"Foundry Local runs AI models directly on your device without cloud connectivity.",
"The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
"Embedding models convert text into numerical vectors for similarity search.",
"Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
"The model catalog provides pre-optimized models that you can download and run locally.",
"Retrieval-augmented generation grounds model responses in your own data.",
"Vector similarity search finds documents that are semantically close to a query.",
"Chat completions generate natural language responses from a prompt and context.",
]
生成文档嵌入
初始化 SDK,加载嵌入模型,并将每个文档转换为数字向量。 这些向量表示每个文档的语义含义,并启用相似性搜索。
将以下代码添加到 main.py:
def main():
# Initialize the SDK
config = Configuration(app_name="foundry_local_rag")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# Load the embedding model
embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
embedding_model.download(
lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
)
print()
embedding_model.load()
embedding_client = embedding_model.get_embedding_client()
# Embed all documents in a single batch call
response = embedding_client.generate_embeddings(documents)
doc_embeddings = [item.embedding for item in response.data]
print(f"Indexed {len(doc_embeddings)} documents.")
该方法 generate_embeddings 接受字符串列表,并为每个输入返回一个向量。 每个向量捕获文本的语义含义,因此类似的文档会生成嵌入空间中紧密相连的向量。
搜索相关文档
若要查找与查询相关的文档,请使用余弦相似性将查询的嵌入与每个文档嵌入进行比较。 余弦相似性测量两个向量的方向有多近,而不考虑数量级。 接近 1.0 的值表示高相似性。
在 main.py 中,添加以下帮助程序函数于 main() 上方:
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
dot = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
def find_relevant(query_embedding, doc_embeddings, top_k=2):
"""Return the indices and scores of the top-k most similar documents."""
scores = []
for i, doc_emb in enumerate(doc_embeddings):
score = cosine_similarity(query_embedding, doc_emb)
scores.append((i, score))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
该 find_relevant 函数按相似性对所有文档进行排名,并返回最匹配项。 此方法适用于小型集合。 对于较大的数据集,请考虑使用专用矢量数据库。
生成基础答案
加载聊天模型,并将检索的文档与系统提示中的用户问题组合在一起。 聊天模型使用提供的上下文生成基于文档的答案。
在嵌入节之后,将以下代码添加到 main() 函数:
# Load the chat model
chat_model = manager.catalog.get_model("qwen2.5-0.5b")
chat_model.download(
lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
)
print()
chat_model.load()
chat_client = chat_model.get_chat_client()
print("\nModels loaded. Ready for questions.")
print("\nThe knowledge base contains information about:")
print(" - Foundry Local features and architecture")
print(" - Supported programming languages")
print(" - Embedding models and vector search")
print(" - ONNX Runtime inference")
print(" - The model catalog")
print(" - RAG and chat completions")
print("\nExample questions:")
print(' "What programming languages does the SDK support?"')
print(' "How does Foundry Local run models?"')
print(' "What is retrieval-augmented generation?"')
print('\nType "quit" to exit.\n')
# Interactive query loop
while True:
query = input("Question: ").strip()
if not query or query.lower() == "quit":
break
# Embed the query
query_response = embedding_client.generate_embedding(query)
query_embedding = query_response.data[0].embedding
# Retrieve the most relevant documents
results = find_relevant(query_embedding, doc_embeddings, top_k=2)
context = "\n".join(f"- {documents[i]}" for i, _ in results)
# Build the prompt with retrieved context
messages = [
{
"role": "system",
"content": (
"Answer the user's question using only the provided context. "
"If the context doesn't contain enough information, say so.\n\n"
f"Context:\n{context}"
),
},
{"role": "user", "content": query},
]
# Stream the response
print("Answer: ", end="", flush=True)
for chunk in chat_client.complete_streaming_chat(messages):
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print("\n")
# Clean up
embedding_model.unload()
chat_model.unload()
print("Models unloaded. Done!")
if __name__ == "__main__":
main()
系统提示指示模型仅使用检索到的上下文进行答案。 系统提示将响应保留在文档中,并减少不正确的答案。 流式处理输出在生成时输出每个标记,使响应更加互动。
完成代码
下面是合并所有步骤的完整应用程序:
import math
from foundry_local_sdk import Configuration, FoundryLocalManager
# Knowledge base
documents = [
"Foundry Local runs AI models directly on your device without cloud connectivity.",
"The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
"Embedding models convert text into numerical vectors for similarity search.",
"Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
"The model catalog provides pre-optimized models that you can download and run locally.",
"Retrieval-augmented generation grounds model responses in your own data.",
"Vector similarity search finds documents that are semantically close to a query.",
"Chat completions generate natural language responses from a prompt and context.",
]
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
dot = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
def find_relevant(query_embedding, doc_embeddings, top_k=2):
"""Return the indices and scores of the top-k most similar documents."""
scores = []
for i, doc_emb in enumerate(doc_embeddings):
score = cosine_similarity(query_embedding, doc_emb)
scores.append((i, score))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
def main():
# Initialize the SDK
config = Configuration(app_name="foundry_local_rag")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# Load the embedding model
embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
embedding_model.download(
lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
)
print()
embedding_model.load()
embedding_client = embedding_model.get_embedding_client()
# Embed all documents
response = embedding_client.generate_embeddings(documents)
doc_embeddings = [item.embedding for item in response.data]
print(f"Indexed {len(doc_embeddings)} documents.")
# Load the chat model
chat_model = manager.catalog.get_model("qwen2.5-0.5b")
chat_model.download(
lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
)
print()
chat_model.load()
chat_client = chat_model.get_chat_client()
print("\nModels loaded. Ready for questions.")
print("\nThe knowledge base contains information about:")
print(" - Foundry Local features and architecture")
print(" - Supported programming languages")
print(" - Embedding models and vector search")
print(" - ONNX Runtime inference")
print(" - The model catalog")
print(" - RAG and chat completions")
print("\nExample questions:")
print(' "What programming languages does the SDK support?"')
print(' "How does Foundry Local run models?"')
print(' "What is retrieval-augmented generation?"')
print('\nType "quit" to exit.\n')
# Interactive query loop
while True:
query = input("Question: ").strip()
if not query or query.lower() == "quit":
break
# Embed the query
query_response = embedding_client.generate_embedding(query)
query_embedding = query_response.data[0].embedding
# Retrieve the most relevant documents
results = find_relevant(query_embedding, doc_embeddings, top_k=2)
context = "\n".join(f"- {documents[i]}" for i, _ in results)
# Build the prompt with retrieved context
messages = [
{
"role": "system",
"content": (
"Answer the user's question using only the provided context. "
"If the context doesn't contain enough information, say so.\n\n"
f"Context:\n{context}"
),
},
{"role": "user", "content": query},
]
# Stream the response
print("Answer: ", end="", flush=True)
for chunk in chat_client.complete_streaming_chat(messages):
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print("\n")
# Clean up
embedding_model.unload()
chat_model.unload()
print("Models unloaded. Done!")
if __name__ == "__main__":
main()
运行应用程序:
python main.py
会看到类似于以下内容的输出:
Downloading embedding model: 100.0%
Indexed 8 documents.
Downloading chat model: 100.0%
Models loaded. Ready for questions.
The knowledge base contains information about:
- Foundry Local features and architecture
- Supported programming languages
- Embedding models and vector search
- ONNX Runtime inference
- The model catalog
- RAG and chat completions
Example questions:
"What programming languages does the SDK support?"
"How does Foundry Local run models?"
"What is retrieval-augmented generation?"
Type "quit" to exit.
Question: What programming languages does the SDK support?
Answer: The Foundry Local SDK supports Python, C#, JavaScript, and Rust.
Question: quit
Models unloaded. Done!
清理资源
卸载模型后,模型权重将保留在本地缓存中。 下次运行应用程序时,将跳过下载步骤,模型加载速度更快。 除非需要回收磁盘空间,否则不需要进行额外的清理。
相关内容
- 使用 Foundry Local 生成文本嵌入
- 教程:构建多回合聊天助手
- 在 Foundry Local 上使用本地聊天完成 API
- Foundry 本地 SDK 参考