建立一個檢索增強生成(RAG)應用程式,能完全在你的裝置上回答關於文件集合的問題。 RAG 結合了基於嵌入的搜尋與聊天模型,使答案基於您自己的數據,而非僅依賴模型的訓練知識。
在這個教學中,你會學到如何:
- 建立專案並安裝 Foundry 本地 SDK
- 建立文字文件知識庫
- 為文件產生內嵌
- 依相似度搜尋相關文件
- 根據檢索的上下文產生答案
- 清理資源
先決條件
- 一台至少有 8GB 記憶體的 Windows、macOS 或 Linux 電腦。
- Python 3.11 或更新版本安裝。
安裝套件
如果你正在開發或部署在 Windows 上,請選擇Windows標籤。Windows 套件整合於 Windows ML 執行環境,提供相同的 API 接口,並提供更廣泛的硬體加速。
建立知識庫
RAG 應用程式需要一組文件來搜尋。 定義一個作為知識庫的文字串清單。 在生產應用程式中,清單可以是檔案、資料庫記錄或任何其他文字來源的段落。
建立一個名為 main.py S 的檔案,並加入以下程式碼:
import math
from foundry_local_sdk import Configuration, FoundryLocalManager
# Knowledge base — each string represents a document
documents = [
"Foundry Local runs AI models directly on your device without cloud connectivity.",
"The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
"Embedding models convert text into numerical vectors for similarity search.",
"Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
"The model catalog provides pre-optimized models that you can download and run locally.",
"Retrieval-augmented generation grounds model responses in your own data.",
"Vector similarity search finds documents that are semantically close to a query.",
"Chat completions generate natural language responses from a prompt and context.",
]
產生文件嵌入
初始化 SDK,載入嵌入模型,並將每份文件轉換為數值向量。 這些向量代表每份文件的語意意義,並能進行相似性搜尋。
將以下程式碼加入 main.py:
def main():
# Initialize the SDK
config = Configuration(app_name="foundry_local_rag")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# Load the embedding model
embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
embedding_model.download(
lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
)
print()
embedding_model.load()
embedding_client = embedding_model.get_embedding_client()
# Embed all documents in a single batch call
response = embedding_client.generate_embeddings(documents)
doc_embeddings = [item.embedding for item in response.data]
print(f"Indexed {len(doc_embeddings)} documents.")
此 generate_embeddings 方法接受一串字串,並每個輸入回傳一個向量。 每個向量都捕捉文本的語意意義,因此相似文件會產生在嵌入空間中彼此接近的向量。
搜尋相關文件
要找到與查詢相關的文件,請使用餘弦相似度將查詢的嵌入與每個文件嵌入進行比較。 餘弦相似度衡量兩個向量在方向上的接近程度,無論大小。 接近1.0的數值表示高度相似。
在上方main()新增以下輔助函式:main.py
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
dot = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
def find_relevant(query_embedding, doc_embeddings, top_k=2):
"""Return the indices and scores of the top-k most similar documents."""
scores = []
for i, doc_emb in enumerate(doc_embeddings):
score = cosine_similarity(query_embedding, doc_emb)
scores.append((i, score))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
該 find_relevant 函數會依相似度排名所有文件,並回傳最匹配的。 這種方法對於小型收藏非常有效。 對於較大的資料集,可以考慮專用的向量資料庫。
產生有根據的答案
載入聊天模型,並將檢索的文件與使用者在系統提示中的問題合併。 聊天模式會利用所提供的情境,產生一個以文件為基礎的答案。
在嵌入段落後,將以下程式碼加入 main() 函式:
# Load the chat model
chat_model = manager.catalog.get_model("qwen2.5-0.5b")
chat_model.download(
lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
)
print()
chat_model.load()
chat_client = chat_model.get_chat_client()
print("\nModels loaded. Ready for questions.")
print("\nThe knowledge base contains information about:")
print(" - Foundry Local features and architecture")
print(" - Supported programming languages")
print(" - Embedding models and vector search")
print(" - ONNX Runtime inference")
print(" - The model catalog")
print(" - RAG and chat completions")
print("\nExample questions:")
print(' "What programming languages does the SDK support?"')
print(' "How does Foundry Local run models?"')
print(' "What is retrieval-augmented generation?"')
print('\nType "quit" to exit.\n')
# Interactive query loop
while True:
query = input("Question: ").strip()
if not query or query.lower() == "quit":
break
# Embed the query
query_response = embedding_client.generate_embedding(query)
query_embedding = query_response.data[0].embedding
# Retrieve the most relevant documents
results = find_relevant(query_embedding, doc_embeddings, top_k=2)
context = "\n".join(f"- {documents[i]}" for i, _ in results)
# Build the prompt with retrieved context
messages = [
{
"role": "system",
"content": (
"Answer the user's question using only the provided context. "
"If the context doesn't contain enough information, say so.\n\n"
f"Context:\n{context}"
),
},
{"role": "user", "content": query},
]
# Stream the response
print("Answer: ", end="", flush=True)
for chunk in chat_client.complete_streaming_chat(messages):
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print("\n")
# Clean up
embedding_model.unload()
chat_model.unload()
print("Models unloaded. Done!")
if __name__ == "__main__":
main()
系統提示詞指示模型僅使用擷取的上下文來回答。 系統提示能讓回答與你的文件保持連結,並減少錯誤答案。 串流輸出會在產生每個詞元時列印,使回應感覺像互動式。
完整程式碼
以下是完整的申請表,結合了所有步驟:
import math
from foundry_local_sdk import Configuration, FoundryLocalManager
# Knowledge base
documents = [
"Foundry Local runs AI models directly on your device without cloud connectivity.",
"The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
"Embedding models convert text into numerical vectors for similarity search.",
"Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
"The model catalog provides pre-optimized models that you can download and run locally.",
"Retrieval-augmented generation grounds model responses in your own data.",
"Vector similarity search finds documents that are semantically close to a query.",
"Chat completions generate natural language responses from a prompt and context.",
]
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
dot = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
def find_relevant(query_embedding, doc_embeddings, top_k=2):
"""Return the indices and scores of the top-k most similar documents."""
scores = []
for i, doc_emb in enumerate(doc_embeddings):
score = cosine_similarity(query_embedding, doc_emb)
scores.append((i, score))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
def main():
# Initialize the SDK
config = Configuration(app_name="foundry_local_rag")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# Load the embedding model
embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
embedding_model.download(
lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
)
print()
embedding_model.load()
embedding_client = embedding_model.get_embedding_client()
# Embed all documents
response = embedding_client.generate_embeddings(documents)
doc_embeddings = [item.embedding for item in response.data]
print(f"Indexed {len(doc_embeddings)} documents.")
# Load the chat model
chat_model = manager.catalog.get_model("qwen2.5-0.5b")
chat_model.download(
lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
)
print()
chat_model.load()
chat_client = chat_model.get_chat_client()
print("\nModels loaded. Ready for questions.")
print("\nThe knowledge base contains information about:")
print(" - Foundry Local features and architecture")
print(" - Supported programming languages")
print(" - Embedding models and vector search")
print(" - ONNX Runtime inference")
print(" - The model catalog")
print(" - RAG and chat completions")
print("\nExample questions:")
print(' "What programming languages does the SDK support?"')
print(' "How does Foundry Local run models?"')
print(' "What is retrieval-augmented generation?"')
print('\nType "quit" to exit.\n')
# Interactive query loop
while True:
query = input("Question: ").strip()
if not query or query.lower() == "quit":
break
# Embed the query
query_response = embedding_client.generate_embedding(query)
query_embedding = query_response.data[0].embedding
# Retrieve the most relevant documents
results = find_relevant(query_embedding, doc_embeddings, top_k=2)
context = "\n".join(f"- {documents[i]}" for i, _ in results)
# Build the prompt with retrieved context
messages = [
{
"role": "system",
"content": (
"Answer the user's question using only the provided context. "
"If the context doesn't contain enough information, say so.\n\n"
f"Context:\n{context}"
),
},
{"role": "user", "content": query},
]
# Stream the response
print("Answer: ", end="", flush=True)
for chunk in chat_client.complete_streaming_chat(messages):
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print("\n")
# Clean up
embedding_model.unload()
chat_model.unload()
print("Models unloaded. Done!")
if __name__ == "__main__":
main()
執行應用程式:
python main.py
你會看到類似的輸出:
Downloading embedding model: 100.0%
Indexed 8 documents.
Downloading chat model: 100.0%
Models loaded. Ready for questions.
The knowledge base contains information about:
- Foundry Local features and architecture
- Supported programming languages
- Embedding models and vector search
- ONNX Runtime inference
- The model catalog
- RAG and chat completions
Example questions:
"What programming languages does the SDK support?"
"How does Foundry Local run models?"
"What is retrieval-augmented generation?"
Type "quit" to exit.
Question: What programming languages does the SDK support?
Answer: The Foundry Local SDK supports Python, C#, JavaScript, and Rust.
Question: quit
Models unloaded. Done!
清理資源
卸載模型後,模型權重仍保留在本地快取中。 下次執行應用程式時,下載步驟會被跳過,模型載入速度變快。 除非你想搶回磁碟空間,否則不需要額外清理。