教學：建立一個 RAG 應用程式

建立一個檢索增強生成（RAG）應用程式，能完全在你的裝置上回答關於文件集合的問題。 RAG 結合了基於嵌入的搜尋與聊天模型，使答案基於您自己的數據，而非僅依賴模型的訓練知識。

在這個教學中，你會學到如何：

建立專案並安裝 Foundry 本地 SDK
建立文字文件知識庫
為文件產生內嵌
依相似度搜尋相關文件
根據檢索的上下文產生答案
清理資源

先決條件

一台至少有 8GB 記憶體的 Windows、macOS 或 Linux 電腦。
Python 3.11 或更新版本安裝。

安裝套件

如果你正在開發或部署在 Windows 上，請選擇Windows標籤。Windows 套件整合於 Windows ML 執行環境，提供相同的 API 接口，並提供更廣泛的硬體加速。

Windows
跨平台

pip install foundry-local-sdk-winml openai

pip install foundry-local-sdk openai

建立知識庫

RAG 應用程式需要一組文件來搜尋。定義一個作為知識庫的文字串清單。在生產應用程式中，清單可以是檔案、資料庫記錄或任何其他文字來源的段落。

建立一個名為 main.py S 的檔案，並加入以下程式碼：

import math
from foundry_local_sdk import Configuration, FoundryLocalManager

# Knowledge base — each string represents a document
documents = [
    "Foundry Local runs AI models directly on your device without cloud connectivity.",
    "The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
    "Embedding models convert text into numerical vectors for similarity search.",
    "Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
    "The model catalog provides pre-optimized models that you can download and run locally.",
    "Retrieval-augmented generation grounds model responses in your own data.",
    "Vector similarity search finds documents that are semantically close to a query.",
    "Chat completions generate natural language responses from a prompt and context.",
]

產生文件嵌入

初始化 SDK，載入嵌入模型，並將每份文件轉換為數值向量。這些向量代表每份文件的語意意義，並能進行相似性搜尋。

將以下程式碼加入 main.py：

def main():
    # Initialize the SDK
    config = Configuration(app_name="foundry_local_rag")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Load the embedding model
    embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
    embedding_model.download(
        lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
    )
    print()
    embedding_model.load()
    embedding_client = embedding_model.get_embedding_client()

    # Embed all documents in a single batch call
    response = embedding_client.generate_embeddings(documents)
    doc_embeddings = [item.embedding for item in response.data]
    print(f"Indexed {len(doc_embeddings)} documents.")

此 generate_embeddings 方法接受一串字串，並每個輸入回傳一個向量。每個向量都捕捉文本的語意意義，因此相似文件會產生在嵌入空間中彼此接近的向量。

搜尋相關文件

要找到與查詢相關的文件，請使用餘弦相似度將查詢的嵌入與每個文件嵌入進行比較。餘弦相似度衡量兩個向量在方向上的接近程度，無論大小。接近1.0的數值表示高度相似。

在上方main()新增以下輔助函式：main.py

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def find_relevant(query_embedding, doc_embeddings, top_k=2):
    """Return the indices and scores of the top-k most similar documents."""
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

該 find_relevant 函數會依相似度排名所有文件，並回傳最匹配的。這種方法對於小型收藏非常有效。對於較大的資料集，可以考慮專用的向量資料庫。

產生有根據的答案

載入聊天模型，並將檢索的文件與使用者在系統提示中的問題合併。聊天模式會利用所提供的情境，產生一個以文件為基礎的答案。

在嵌入段落後，將以下程式碼加入 main() 函式：

    # Load the chat model
    chat_model = manager.catalog.get_model("qwen2.5-0.5b")
    chat_model.download(
        lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
    )
    print()
    chat_model.load()
    chat_client = chat_model.get_chat_client()

    print("\nModels loaded. Ready for questions.")
    print("\nThe knowledge base contains information about:")
    print("  - Foundry Local features and architecture")
    print("  - Supported programming languages")
    print("  - Embedding models and vector search")
    print("  - ONNX Runtime inference")
    print("  - The model catalog")
    print("  - RAG and chat completions")
    print("\nExample questions:")
    print('  "What programming languages does the SDK support?"')
    print('  "How does Foundry Local run models?"')
    print('  "What is retrieval-augmented generation?"')
    print('\nType "quit" to exit.\n')

    # Interactive query loop
    while True:
        query = input("Question: ").strip()
        if not query or query.lower() == "quit":
            break

        # Embed the query
        query_response = embedding_client.generate_embedding(query)
        query_embedding = query_response.data[0].embedding

        # Retrieve the most relevant documents
        results = find_relevant(query_embedding, doc_embeddings, top_k=2)
        context = "\n".join(f"- {documents[i]}" for i, _ in results)

        # Build the prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided context. "
                    "If the context doesn't contain enough information, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ]

        # Stream the response
        print("Answer: ", end="", flush=True)
        for chunk in chat_client.complete_streaming_chat(messages):
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n")

    # Clean up
    embedding_model.unload()
    chat_model.unload()
    print("Models unloaded. Done!")


if __name__ == "__main__":
    main()

系統提示詞指示模型僅使用擷取的上下文來回答。系統提示能讓回答與你的文件保持連結，並減少錯誤答案。串流輸出會在產生每個詞元時列印，使回應感覺像互動式。

完整程式碼

以下是完整的申請表，結合了所有步驟：

import math
from foundry_local_sdk import Configuration, FoundryLocalManager

# Knowledge base
documents = [
    "Foundry Local runs AI models directly on your device without cloud connectivity.",
    "The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
    "Embedding models convert text into numerical vectors for similarity search.",
    "Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
    "The model catalog provides pre-optimized models that you can download and run locally.",
    "Retrieval-augmented generation grounds model responses in your own data.",
    "Vector similarity search finds documents that are semantically close to a query.",
    "Chat completions generate natural language responses from a prompt and context.",
]


def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def find_relevant(query_embedding, doc_embeddings, top_k=2):
    """Return the indices and scores of the top-k most similar documents."""
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]


def main():
    # Initialize the SDK
    config = Configuration(app_name="foundry_local_rag")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Load the embedding model
    embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
    embedding_model.download(
        lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
    )
    print()
    embedding_model.load()
    embedding_client = embedding_model.get_embedding_client()

    # Embed all documents
    response = embedding_client.generate_embeddings(documents)
    doc_embeddings = [item.embedding for item in response.data]
    print(f"Indexed {len(doc_embeddings)} documents.")

    # Load the chat model
    chat_model = manager.catalog.get_model("qwen2.5-0.5b")
    chat_model.download(
        lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
    )
    print()
    chat_model.load()
    chat_client = chat_model.get_chat_client()

    print("\nModels loaded. Ready for questions.")
    print("\nThe knowledge base contains information about:")
    print("  - Foundry Local features and architecture")
    print("  - Supported programming languages")
    print("  - Embedding models and vector search")
    print("  - ONNX Runtime inference")
    print("  - The model catalog")
    print("  - RAG and chat completions")
    print("\nExample questions:")
    print('  "What programming languages does the SDK support?"')
    print('  "How does Foundry Local run models?"')
    print('  "What is retrieval-augmented generation?"')
    print('\nType "quit" to exit.\n')

    # Interactive query loop
    while True:
        query = input("Question: ").strip()
        if not query or query.lower() == "quit":
            break

        # Embed the query
        query_response = embedding_client.generate_embedding(query)
        query_embedding = query_response.data[0].embedding

        # Retrieve the most relevant documents
        results = find_relevant(query_embedding, doc_embeddings, top_k=2)
        context = "\n".join(f"- {documents[i]}" for i, _ in results)

        # Build the prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided context. "
                    "If the context doesn't contain enough information, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ]

        # Stream the response
        print("Answer: ", end="", flush=True)
        for chunk in chat_client.complete_streaming_chat(messages):
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n")

    # Clean up
    embedding_model.unload()
    chat_model.unload()
    print("Models unloaded. Done!")


if __name__ == "__main__":
    main()

執行應用程式：

python main.py

你會看到類似的輸出：

Downloading embedding model: 100.0%
Indexed 8 documents.
Downloading chat model: 100.0%

Models loaded. Ready for questions.

The knowledge base contains information about:
  - Foundry Local features and architecture
  - Supported programming languages
  - Embedding models and vector search
  - ONNX Runtime inference
  - The model catalog
  - RAG and chat completions

Example questions:
  "What programming languages does the SDK support?"
  "How does Foundry Local run models?"
  "What is retrieval-augmented generation?"

Type "quit" to exit.

Question: What programming languages does the SDK support?
Answer: The Foundry Local SDK supports Python, C#, JavaScript, and Rust.

Question: quit
Models unloaded. Done!

清理資源

卸載模型後，模型權重仍保留在本地快取中。下次執行應用程式時，下載步驟會被跳過，模型載入速度變快。除非你想搶回磁碟空間，否則不需要額外清理。

意見反應

此頁面對您有幫助嗎？

Last updated on 2026-05-12