教學:建立一個 RAG 應用程式

建立一個檢索增強生成(RAG)應用程式,能完全在你的裝置上回答關於文件集合的問題。 RAG 結合了基於嵌入的搜尋與聊天模型,使答案基於您自己的數據,而非僅依賴模型的訓練知識。

在這個教學中,你會學到如何:

  • 建立專案並安裝 Foundry 本地 SDK
  • 建立文字文件知識庫
  • 為文件產生內嵌
  • 依相似度搜尋相關文件
  • 根據檢索的上下文產生答案
  • 清理資源

先決條件

  • 一台至少有 8GB 記憶體的 Windows、macOS 或 Linux 電腦。
  • Python 3.11 或更新版本安裝。

安裝套件

如果你正在開發或部署在 Windows 上,請選擇Windows標籤。Windows 套件整合於 Windows ML 執行環境,提供相同的 API 接口,並提供更廣泛的硬體加速。

pip install foundry-local-sdk-winml openai

建立知識庫

RAG 應用程式需要一組文件來搜尋。 定義一個作為知識庫的文字串清單。 在生產應用程式中,清單可以是檔案、資料庫記錄或任何其他文字來源的段落。

建立一個名為 main.py S 的檔案,並加入以下程式碼:

import math
from foundry_local_sdk import Configuration, FoundryLocalManager

# Knowledge base — each string represents a document
documents = [
    "Foundry Local runs AI models directly on your device without cloud connectivity.",
    "The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
    "Embedding models convert text into numerical vectors for similarity search.",
    "Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
    "The model catalog provides pre-optimized models that you can download and run locally.",
    "Retrieval-augmented generation grounds model responses in your own data.",
    "Vector similarity search finds documents that are semantically close to a query.",
    "Chat completions generate natural language responses from a prompt and context.",
]

產生文件嵌入

初始化 SDK,載入嵌入模型,並將每份文件轉換為數值向量。 這些向量代表每份文件的語意意義,並能進行相似性搜尋。

將以下程式碼加入 main.py

def main():
    # Initialize the SDK
    config = Configuration(app_name="foundry_local_rag")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Load the embedding model
    embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
    embedding_model.download(
        lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
    )
    print()
    embedding_model.load()
    embedding_client = embedding_model.get_embedding_client()

    # Embed all documents in a single batch call
    response = embedding_client.generate_embeddings(documents)
    doc_embeddings = [item.embedding for item in response.data]
    print(f"Indexed {len(doc_embeddings)} documents.")

generate_embeddings 方法接受一串字串,並每個輸入回傳一個向量。 每個向量都捕捉文本的語意意義,因此相似文件會產生在嵌入空間中彼此接近的向量。

搜尋相關文件

要找到與查詢相關的文件,請使用餘弦相似度將查詢的嵌入與每個文件嵌入進行比較。 餘弦相似度衡量兩個向量在方向上的接近程度,無論大小。 接近1.0的數值表示高度相似。

在上方main()新增以下輔助函式:main.py

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def find_relevant(query_embedding, doc_embeddings, top_k=2):
    """Return the indices and scores of the top-k most similar documents."""
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

find_relevant 函數會依相似度排名所有文件,並回傳最匹配的。 這種方法對於小型收藏非常有效。 對於較大的資料集,可以考慮專用的向量資料庫。

產生有根據的答案

載入聊天模型,並將檢索的文件與使用者在系統提示中的問題合併。 聊天模式會利用所提供的情境,產生一個以文件為基礎的答案。

在嵌入段落後,將以下程式碼加入 main() 函式:

    # Load the chat model
    chat_model = manager.catalog.get_model("qwen2.5-0.5b")
    chat_model.download(
        lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
    )
    print()
    chat_model.load()
    chat_client = chat_model.get_chat_client()

    print("\nModels loaded. Ready for questions.")
    print("\nThe knowledge base contains information about:")
    print("  - Foundry Local features and architecture")
    print("  - Supported programming languages")
    print("  - Embedding models and vector search")
    print("  - ONNX Runtime inference")
    print("  - The model catalog")
    print("  - RAG and chat completions")
    print("\nExample questions:")
    print('  "What programming languages does the SDK support?"')
    print('  "How does Foundry Local run models?"')
    print('  "What is retrieval-augmented generation?"')
    print('\nType "quit" to exit.\n')

    # Interactive query loop
    while True:
        query = input("Question: ").strip()
        if not query or query.lower() == "quit":
            break

        # Embed the query
        query_response = embedding_client.generate_embedding(query)
        query_embedding = query_response.data[0].embedding

        # Retrieve the most relevant documents
        results = find_relevant(query_embedding, doc_embeddings, top_k=2)
        context = "\n".join(f"- {documents[i]}" for i, _ in results)

        # Build the prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided context. "
                    "If the context doesn't contain enough information, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ]

        # Stream the response
        print("Answer: ", end="", flush=True)
        for chunk in chat_client.complete_streaming_chat(messages):
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n")

    # Clean up
    embedding_model.unload()
    chat_model.unload()
    print("Models unloaded. Done!")


if __name__ == "__main__":
    main()

系統提示詞指示模型僅使用擷取的上下文來回答。 系統提示能讓回答與你的文件保持連結,並減少錯誤答案。 串流輸出會在產生每個詞元時列印,使回應感覺像互動式。

完整程式碼

以下是完整的申請表,結合了所有步驟:

import math
from foundry_local_sdk import Configuration, FoundryLocalManager

# Knowledge base
documents = [
    "Foundry Local runs AI models directly on your device without cloud connectivity.",
    "The Foundry Local SDK supports Python, C#, JavaScript, and Rust.",
    "Embedding models convert text into numerical vectors for similarity search.",
    "Foundry Local uses ONNX Runtime for efficient model inference on CPUs and GPUs.",
    "The model catalog provides pre-optimized models that you can download and run locally.",
    "Retrieval-augmented generation grounds model responses in your own data.",
    "Vector similarity search finds documents that are semantically close to a query.",
    "Chat completions generate natural language responses from a prompt and context.",
]


def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def find_relevant(query_embedding, doc_embeddings, top_k=2):
    """Return the indices and scores of the top-k most similar documents."""
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]


def main():
    # Initialize the SDK
    config = Configuration(app_name="foundry_local_rag")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Load the embedding model
    embedding_model = manager.catalog.get_model("qwen3-embedding-0.6b")
    embedding_model.download(
        lambda p: print(f"\rDownloading embedding model: {p:.1f}%", end="", flush=True)
    )
    print()
    embedding_model.load()
    embedding_client = embedding_model.get_embedding_client()

    # Embed all documents
    response = embedding_client.generate_embeddings(documents)
    doc_embeddings = [item.embedding for item in response.data]
    print(f"Indexed {len(doc_embeddings)} documents.")

    # Load the chat model
    chat_model = manager.catalog.get_model("qwen2.5-0.5b")
    chat_model.download(
        lambda p: print(f"\rDownloading chat model: {p:.1f}%", end="", flush=True)
    )
    print()
    chat_model.load()
    chat_client = chat_model.get_chat_client()

    print("\nModels loaded. Ready for questions.")
    print("\nThe knowledge base contains information about:")
    print("  - Foundry Local features and architecture")
    print("  - Supported programming languages")
    print("  - Embedding models and vector search")
    print("  - ONNX Runtime inference")
    print("  - The model catalog")
    print("  - RAG and chat completions")
    print("\nExample questions:")
    print('  "What programming languages does the SDK support?"')
    print('  "How does Foundry Local run models?"')
    print('  "What is retrieval-augmented generation?"')
    print('\nType "quit" to exit.\n')

    # Interactive query loop
    while True:
        query = input("Question: ").strip()
        if not query or query.lower() == "quit":
            break

        # Embed the query
        query_response = embedding_client.generate_embedding(query)
        query_embedding = query_response.data[0].embedding

        # Retrieve the most relevant documents
        results = find_relevant(query_embedding, doc_embeddings, top_k=2)
        context = "\n".join(f"- {documents[i]}" for i, _ in results)

        # Build the prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided context. "
                    "If the context doesn't contain enough information, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ]

        # Stream the response
        print("Answer: ", end="", flush=True)
        for chunk in chat_client.complete_streaming_chat(messages):
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n")

    # Clean up
    embedding_model.unload()
    chat_model.unload()
    print("Models unloaded. Done!")


if __name__ == "__main__":
    main()

執行應用程式:

python main.py

你會看到類似的輸出:

Downloading embedding model: 100.0%
Indexed 8 documents.
Downloading chat model: 100.0%

Models loaded. Ready for questions.

The knowledge base contains information about:
  - Foundry Local features and architecture
  - Supported programming languages
  - Embedding models and vector search
  - ONNX Runtime inference
  - The model catalog
  - RAG and chat completions

Example questions:
  "What programming languages does the SDK support?"
  "How does Foundry Local run models?"
  "What is retrieval-augmented generation?"

Type "quit" to exit.

Question: What programming languages does the SDK support?
Answer: The Foundry Local SDK supports Python, C#, JavaScript, and Rust.

Question: quit
Models unloaded. Done!

清理資源

卸載模型後,模型權重仍保留在本地快取中。 下次執行應用程式時,下載步驟會被跳過,模型載入速度變快。 除非你想搶回磁碟空間,否則不需要額外清理。