教學：第三部分 - 使用 Microsoft Foundry SDK 評估自訂聊天應用程式

在這個教學中，你會評估你在教學系列第二部分中打造的聊天應用程式。你在多個指標上評估應用程式的品質，然後不斷改進。在此部分中，您會：

建立評估數據集
使用 Azure AI 評估工具評估聊天應用程式
反覆運算並改善您的應用程式

本教學建立在第二部分：使用 Microsoft Foundry SDK 建立自訂聊天應用程式的基礎上。

先決條件

備註

本教學課程使用 中樞型專案。這裡展示的步驟和程式碼不適用於 Foundry 專案。如需詳細資訊，請參閱項目類型。

完成教學系列的第二部分，建立聊天應用程式。
使用你在第一部分創建的同一個 Microsoft Foundry 專案。
Azure AI 權限：擁有者或貢獻者角色，可修改模型端點速率限制並執行評估工作。
務必完成第二部分新增遙測記錄的步驟。

建立評估資料集

請使用以下評估資料集，其中包含範例問題與預期答案。利用此資料集搭配評估器及 get_chat_response() 目標函數，評估聊天應用程式在相關性、貼地性及連貫性指標上的表現。

在你的 assets 資料夾裡建立一個名為 chat_eval_data.jsonl 的檔案。

將此資料集貼到檔案中：

{"query": "Which tent is the most waterproof?", "truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"}
{"query": "Which camping table holds the most weight?", "truth": "The Adventure Dining Table has a higher weight capacity than all of the other camping tables mentioned"}
{"query": "How much do the TrailWalker Hiking Shoes cost? ", "truth": "The Trailewalker Hiking Shoes are priced at $110"}
{"query": "What is the proper care for trailwalker hiking shoes? ", "truth": "After each use, remove any dirt or debris by brushing or wiping the shoes with a damp cloth."}
{"query": "What brand is TrailMaster tent? ", "truth": "OutdoorLiving"}
{"query": "How do I carry the TrailMaster tent around? ", "truth": " Carry bag included for convenient storage and transportation"}
{"query": "What is the floor area for Floor Area? ", "truth": "80 square feet"}
{"query": "What is the material for TrailBlaze Hiking Pants?", "truth": "Made of high-quality nylon fabric"}
{"query": "What color does TrailBlaze Hiking Pants come in?", "truth": "Khaki"}
{"query": "Can the warrenty for TrailBlaze pants be transfered? ", "truth": "The warranty is non-transferable and applies only to the original purchaser of the TrailBlaze Hiking Pants. It is valid only when the product is purchased from an authorized retailer."}
{"query": "How long are the TrailBlaze pants under warranty for? ", "truth": " The TrailBlaze Hiking Pants are backed by a 1-year limited warranty from the date of purchase."}
{"query": "What is the material for PowerBurner Camping Stove? ", "truth": "Stainless Steel"}
{"query": "Is France in Europe?", "truth": "Sorry, I can only queries related to outdoor/camping gear and equipment"}

參考資料：評估資料集的 JSONL 格式。

使用 Azure AI 評估工具進行評估

建立一個評估腳本，產生目標函式包裝器、載入資料集、執行評估，並將結果記錄到你的 Foundry 專案。

在主資料夾裡建立一個名為 evaluate.py 的檔案。

新增下列程式代碼以匯入必要的連結庫、建立專案用戶端，以及設定一些設定：

import os
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import ConnectionType
from azure.ai.evaluation import evaluate, GroundednessEvaluator
from azure.identity import DefaultAzureCredential

from chat_with_products import chat_with_products

# load environment variables from the .env file at the root of this repo
from dotenv import load_dotenv

load_dotenv()

# create a project client using environment variables loaded from the .env file
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)

connection = project.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI, include_credentials=True)

evaluator_model = {
    "azure_endpoint": connection.endpoint_url,
    "azure_deployment": os.environ["EVALUATION_MODEL"],
    "api_version": "2024-06-01",
    "api_key": connection.key,
}

groundedness = GroundednessEvaluator(evaluator_model)

References： AIProjectClient， DefaultAzureCredential， azure-ai-evaluation.

新增程式代碼以建立包裝函式，以實作查詢和回應評估的評估介面：

def evaluate_chat_with_products(query):
    response = chat_with_products(messages=[{"role": "user", "content": query}])
    return {"response": response["message"].content, "context": response["context"]["grounding_data"]}

參考資料： azure-ai-evaluation、評估目標功能。

最後，新增程式碼來執行評估、在本地查看結果，並在 Foundry 入口網站取得評估結果的連結：

# Evaluate must be called inside of __main__, not on import
if __name__ == "__main__":
    from config import ASSET_PATH

    # workaround for multiprocessing issue on linux
    from pprint import pprint
    from pathlib import Path
    import multiprocessing
    import contextlib

    with contextlib.suppress(RuntimeError):
        multiprocessing.set_start_method("spawn", force=True)

    # run evaluation with a dataset and target function, log to the project
    result = evaluate(
        data=Path(ASSET_PATH) / "chat_eval_data.jsonl",
        target=evaluate_chat_with_products,
        evaluation_name="evaluate_chat_with_products",
        evaluators={
            "groundedness": groundedness,
        },
        evaluator_config={
            "default": {
                "query": {"${data.query}"},
                "response": {"${target.response}"},
                "context": {"${target.context}"},
            }
        },
        azure_ai_project=project.scope,
        output_path="./myevalresults.json",
    )

    tabular_result = pd.DataFrame(result.get("rows"))

    pprint("-----Summarized Metrics-----")
    pprint(result["metrics"])
    pprint("-----Tabular Result-----")
    pprint(tabular_result)
    pprint(f"View evaluation results in AI Studio: {result['studio_url']}")

References： azure-ai-evaluation， AIProjectClient.

設定評估模型

評估腳本會多次呼叫模型。考慮增加評估模型每分鐘的代幣數量。

在本教學課程系列的第 1 部分中，您已建立一個 .env 檔案，指定評估模型 gpt-4o-mini的名稱。如果您有可用的配額，請嘗試增加此模型的每分鐘令牌限制。如果您沒有足夠的配額可增加此值，請不要擔心。腳本的設計目的是要處理限制錯誤。

在 Foundry 入口網站的專案中，選擇Models + endpoints。
選取 gpt-4o-mini。
請選取 ，再編輯。
如果你有配額， 請將每分鐘代幣數上限 提高到 30 個或以上。
選擇 儲存並關閉。

執行評估指令碼

從你的控制台，使用 Azure CLI 登入你的 Azure 帳號：
```
az login
```
安裝必要的套件：
```
pip install azure-ai-evaluation[remote]
```
參考資料：azure-ai-evaluation SDK、Evaluation SDK 文件。

確認你的評估設定

在進行完整評估（需時 5–10 分鐘）前，先透過以下快速測試確認 SDK 與專案連線是否正常運作：

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

# Verify connection to project
client = AIProjectClient.from_config(credential=DefaultAzureCredential())
print("Evaluation SDK is ready! You can now run evaluate.py")

如果你看到 "Evaluation SDK is ready!"，你的設定就完成了，可以繼續進行。

參考資料： AIProjectClient， DefaultAzureCredential.

開始評估

執行評估腳本：
```
python evaluate.py
```

評估過程需時5至10分鐘。你可能會看到逾時警告和速率限制錯誤。腳本會自動處理這些錯誤並繼續處理。

解譯評估輸出

在主控台輸出中，你會看到每個問題的答案，接著是一張總結指標的表格，顯示相關性、紮實性和連貫性分數。 GPT輔助指標的分數範圍從0（最差）到4（最佳）不等。尋找低接地性分數來辨識參考文獻支持不足的回答，低相關性分數則用來辨識離題回答。

你可能會看到很多 WARNING:opentelemetry.attributes: 訊息和逾時錯誤。您可以放心地忽略這些訊息。它們不會影響評估結果。評估腳本設計用來處理速率限制錯誤並持續處理。

評估結果輸出還包含一個連結，可在 Foundry 入口網站查看詳細結果，您可以並排比較評估流程並追蹤隨時間的改進。

====================================================
'-----Summarized Metrics-----'
{'groundedness.gpt_groundedness': 1.6666666666666667,
 'groundedness.groundedness': 1.6666666666666667}
'-----Tabular Result-----'
                                     outputs.response  ... line_number
0   Could you specify which tent you are referring...  ...           0
1   Could you please specify which camping table y...  ...           1
2   Sorry, I only can answer queries related to ou...  ...           2
3   Could you please clarify which aspects of care...  ...           3
4   Sorry, I only can answer queries related to ou...  ...           4
5   The TrailMaster X4 Tent comes with an included...  ...           5
6                                            (Failed)  ...           6
7   The TrailBlaze Hiking Pants are crafted from h...  ...           7
8   Sorry, I only can answer queries related to ou...  ...           8
9   Sorry, I only can answer queries related to ou...  ...           9
10  Sorry, I only can answer queries related to ou...  ...          10
11  The PowerBurner Camping Stove is designed with...  ...          11
12  Sorry, I only can answer queries related to ou...  ...          12

[13 rows x 8 columns]
('View evaluation results in Foundry portal: '
 'https://xxxxxxxxxxxxxxxxxxxxxxx')

反覆運算和改善

評估結果顯示，回應往往並未充分基於參考文件。為了提升基礎性，請修改 assets/grounded_chat.prompty 檔案中的系統提示，鼓勵模型更有效地使用參考文件。

目前提示（有問題）：

If the question is not related to outdoor/camping gear and clothing, just say 'Sorry, I only can answer queries related to outdoor/camping gear and clothing. So, how can I help?'
If the question is related to outdoor/camping gear and clothing but vague, ask clarifying questions.

改進提示：

If the question is related to outdoor/camping gear and clothing, answer based on the reference documents provided.
If you cannot find information in the reference documents, say: 'I don't have information about that specific topic. Let me help with related products or try a different question.'
For vague questions, ask clarifying questions to better assist.

更新提示後：

儲存檔案。
再執行一次評估腳本：
```
python evaluate.py
```
將新的評估結果與上一次的測試做比較。你應該能看到可靠性分數的提升。

嘗試其他調整，例如：

將系統提示改為專注於準確性而非完整性
使用不同模型測試（例如，gpt-4-turbo 如果有的話）
調整上下文檢索以返回更相關的文件

每次迭代都能幫助你了解哪些變更能改善特定指標。

清理資源

為避免不必要的 Azure 成本，如果不再需要，請刪除你在教學中建立的資源。若要管理資源，您可以使用 Azure 入口網站。

了解更多關於 Microsoft Foundry SDK 的資訊

意見反應

此頁面對您有幫助嗎？

Last updated on 2025-12-18