使用 Azure AI 評估 SDK 在本機評估 AI 代理程式（預覽）

2025-05-19

這很重要

本文中標示為 (預覽) 的項目目前處於公開預覽狀態。此預覽版本沒有服務等級協定，不建議將其用於生產工作負載。可能不支援特定功能，或可能已經限制功能。如需詳細資訊，請參閱 Microsoft Azure 預覽版增補使用條款。

AI 代理程式是強大的生產力助理，可建立商務需求的工作流程。不過，由於其複雜的互動模式，它們面臨著可觀察性的挑戰。在本文中，您將瞭解如何在簡單的代理程序數據或代理程式訊息上本機執行內建評估工具。

若要建置生產就緒的代理應用程式並啟用可觀察性和透明度，開發人員不僅需要工具來評估代理程式工作流程的最終輸出，還要評估工作流程本身的品質和效率。例如，請考慮典型的代理工作流程：

像是使用者查詢「明天天氣」的事件會觸發代理工作流程。它會開始執行多個步驟，例如透過使用者意圖進行推理、工具呼叫，以及利用擷取增強的產生來產生最終回應。在此程式中，評估工作流程的每個步驟以及最終輸出的品質和安全性非常重要。具體來說，我們將這些評估層面制定成下列代辦者的評估工具：

意圖解析：測量代理程式是否正確地識別用戶的意圖。
工具呼叫精確度：測量代理程式是否對使用者的要求進行正確的函式工具呼叫。
工作遵循：根據代理程式的系統訊息和先前步驟，測量代理程序的最終回應是否遵守其指派的工作。

您也可以使用我們完整的內建評估工具套件，評估代理工作流程的其他品質和安全性層面。一般而言，代理程式會發出代理程式訊息。將代理程式訊息轉換成正確的評估數據，以使用我們的評估工具可能是非有趣的工作。如果您使用 Azure AI 代理程式服務建置代理程式，您可以透過我們的轉換器支援順暢地評估它。如果您在 Azure AI 代理程式服務外部建置代理程式，您仍然可以將代理程式訊息剖析為所需的數據格式，以適當地使用我們的評估工具。請參閱評估其他代理程式的範例。

入門指南

首先，從 Azure AI 評估 SDK 安裝評估工具套件：

pip install azure-ai-evaluation

評估 Azure AI 代理程式

不過，如果您使用 Azure AI 代理服務，您可以透過我們對 Azure AI 代理執行緒和執行的轉換器支援，順暢地評估您的代理。我們支援來自轉換器的 Azure AI 代理程式訊息評估工具清單：

品質：IntentResolution、、、TaskAdherenceToolCallAccuracy、CoherenceRelevance、Fluency
安全：CodeVulnerabilities、Violence、Self-harm、Sexual、HateUnfairness、IndirectAttack、ProtectedMaterials。

備註

ToolCallAccuracyEvaluator 僅支援 Azure AI 代理程式的函式工具評估，但不支援內建工具評估。代理程式訊息必須至少呼叫一個函式工具，才能進行評估。

以下是順暢建置和評估 Azure AI 代理程式的範例。除了評估之外，Azure AI Foundry 代理程式服務還需要 pip install azure-ai-projects azure-identity 和 Azure AI 專案連接字串和支援的模型。

建立代理程式線程並執行

import os, json
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from typing import Set, Callable, Any
from azure.ai.projects.models import FunctionTool, ToolSet

from dotenv import load_dotenv

load_dotenv()

# Define some custom python function
def fetch_weather(location: str) -> str:
    """
    Fetches the weather information for the specified location.

    :param location (str): The location to fetch weather for.
    :return: Weather information as a JSON string.
    :rtype: str
    """
    # In a real-world scenario, you'd integrate with a weather API.
    # Here, we'll mock the response.
    mock_weather_data = {"Seattle": "Sunny, 25°C", "London": "Cloudy, 18°C", "Tokyo": "Rainy, 22°C"}
    weather = mock_weather_data.get(location, "Weather data not available for this location.")
    weather_json = json.dumps({"weather": weather})
    return weather_json


user_functions: Set[Callable[..., Any]] = {
    fetch_weather,
}

# Adding Tools to be used by Agent 
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)


# Create the agent
AGENT_NAME = "Seattle Tourist Assistant"

project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=os.environ["PROJECT_CONNECTION_STRING"],
)

agent = project_client.agents.create_agent(
    model=os.environ["MODEL_DEPLOYMENT_NAME"],
    name=AGENT_NAME,
    instructions="You are a helpful assistant",
    toolset=toolset,
)
print(f"Created agent, ID: {agent.id}")

thread = project_client.agents.create_thread()
print(f"Created thread, ID: {thread.id}")

# Create message to thread
MESSAGE = "Can you fetch me the weather in Seattle?"

message = project_client.agents.create_message(
    thread_id=thread.id,
    role="user",
    content=MESSAGE,
)
print(f"Created message, ID: {message.id}")

run = project_client.agents.create_and_process_run(thread_id=thread.id, agent_id=agent.id)

print(f"Run finished with status: {run.status}")

if run.status == "failed":
    print(f"Run failed: {run.last_error}")

print(f"Run ID: {run.id}")

# display messages
for message in project_client.agents.list_messages(thread.id, order="asc").data:
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print("-" * 40)

評估單一代理程式運行

建立代理程式執行之後，您可以輕鬆地使用我們的轉換器，將 Azure AI 代理程式線程資料轉換成評估工具可以瞭解的必要評估數據。

import json, os
from azure.ai.evaluation import AIAgentConverter, IntentResolutionEvaluator

# Initialize the converter for Azure AI agents
converter = AIAgentConverter(project_client)

# Specify the thread and run id
thread_id = thread.id
run_id = run.id

converted_data = converter.convert(thread_id, run_id)

就是這樣！您不需要讀取每個評估工具的輸入需求，並執行任何工作來剖析它們。您只需要選取您的評估者，並在這單一過程中呼叫評估者。針對模型選擇，我們建議選擇強大的推理模型，例如o3-mini 和之後發布的其他模型。我們會在quality_evaluatorssafety_evaluators設置品質和安全性評估員的清單，並在評估多個代理執行或線程時參考它們。

# specific to agentic workflows
from azure.ai.evaluation import IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator 
# other quality as well as risk and safety metrics
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, CodeVulnerabilityEvaluator, ContentSafetyEvaluator, IndirectAttackEvaluator, FluencyEvaluator
from azure.ai.projects.models import ConnectionType
from azure.identity import DefaultAzureCredential

import os
from dotenv import load_dotenv
load_dotenv()

model_config = project_client.connections.get_default(
                                            connection_type=ConnectionType.AZURE_OPEN_AI,
                                            include_credentials=True) \
                                         .to_evaluator_model_config(
                                            deployment_name="o3-mini",
                                            api_version="2023-05-15",
                                            include_credentials=True
                                          )

quality_evaluators = {evaluator.__name__: evaluator(model_config=model_config) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator, CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]}


## Using Azure AI Foundry Hub
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}
## Using Azure AI Foundry Development Platform, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")

safety_evaluators = {evaluator.__name__: evaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()) for evaluator in[ContentSafetyEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator]}

# reference the quality and safety evaluator list above
quality_and_safety_evaluators = {**quality_evaluators, **safety_evaluators}

for name, evaluator in quality_and_safety_evaluators.items():
   try:
      result = evaluator(**converted_data)
      print(name)
      print(json.dumps(result, indent=4)) 
   except:
      print("Note: if there is no tool call to evaluate in the run history, ToolCallAccuracyEvaluator will raise an error")
      pass

輸出格式

查詢與回應群組 AI 輔助品質評估工具的結果是字典，其中包含：

{metric_name} 提供數值分數，可以是李克特量表上的整數（1 到 5）或介於 0 到 1 之間的浮點數。
{metric_name}_label 提供二進位標籤（如果計量自然輸出二進位分數）。
{metric_name}_reason 說明為何為每個數據點指定特定分數或標籤。

為了進一步改善可理解性，所有評估者都接受一個二進制閾值（除非其已輸出二進制結果），並輸出兩個新的鍵值。針對二進位化閾值，會設定預設值，並且使用者可以覆寫此設定。這兩個新的索引鍵為：

{metric_name}_result 依據二值化臨界值判定的「通過」或「失敗」字串。
{metric_name}_threshold 默認或用戶所設定的數值二進位化臨界值。
additional_details 包含單一代理程式執行品質的偵錯資訊。

某些評估器的範例輸出：

{
    "intent_resolution": 5.0, # likert scale: 1-5 integer 
    "intent_resolution_result": "pass", # pass because 5 > 3 the threshold
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The assistant correctly understood the user's request to fetch the weather in Seattle. It used the appropriate tool to get the weather information and provided a clear and accurate response with the current weather conditions in Seattle. The response fully resolves the user's query with all necessary information.",
    "additional_details": {
        "conversation_has_intent": true,
        "agent_perceived_intent": "fetch the weather in Seattle",
        "actual_user_intent": "fetch the weather in Seattle",
        "correct_intent_detected": true,
        "intent_resolved": true
    }
}
{
    "task_adherence": 5.0, # likert scale: 1-5 integer 
    "task_adherence_result": "pass", # pass because 5 > 3 the threshold
    "task_adherence_threshold": 3,
    "task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
}
{
    "tool_call_accuracy": 1.0,  # this is the average of all correct tool calls (or passing rate) 
    "tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The tool call is directly relevant to the user's query, uses the correct parameter, and the parameter value is correctly extracted from the conversation.",
            "tool_call_id": "call_2svVc9rNxMT9F50DuEf1XExx"
        }
    ]
}

評估多個代理程式執行或線程

若要評估多個代理程式執行或線程，建議您使用批次 evaluate() API 進行異步評估。首先，透過我們的轉換器支援，將您的代理程式線程資料轉換成檔案：

import json
from azure.ai.evaluation import AIAgentConverter

# Initialize the converter
converter = AIAgentConverter(project_client)

# Specify a file path to save agent output (which is evaluation input data)
filename = os.path.join(os.getcwd(), "evaluation_input_data.jsonl")

evaluation_data = converter.prepare_evaluation_data(thread_ids=thread_id, filename=filename) 

print(f"Evaluation data saved to {filename}")

透過用一行程式碼準備的評估數據，您可以選擇評估者來評估代理品質，並提交批次評估任務。在這裡，我們會參考在評估單一代理程式執行quality_and_safety_evaluators一節中相同的品質和安全性評估工具清單：

import os
from dotenv import load_dotenv
load_dotenv()


# Batch evaluation API (local)
from azure.ai.evaluation import evaluate

response = evaluate(
    data=filename,
    evaluation_name="agent demo - batch run",
    evaluators=quality_and_safety_evaluators,
    # optionally, log your results to your Azure AI Foundry project for rich visualization 
    azure_ai_project={
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["PROJECT_NAME"],
        "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
    }
)
# Inspect the average scores at a high-level
print(response["metrics"])
# Use the URL to inspect the results on the UI
print(f'AI Foundary URL: {response.get("studio_url")}')

在 URI 之後，系統會將您重新導向至 Foundry，以在 Azure AI 專案中檢視您的評估結果，並偵錯您的應用程式。使用原因欄位和通過/失敗，您可以輕鬆地評估應用程式的品質和安全性效能。您可以執行並比較多個回合，以測試回歸或改善。

透過 Azure AI 評估 SDK 用戶端連結庫，您可以透過我們的轉換器支援順暢地評估您的 Azure AI 代理程式，進而對代理程式工作流程啟用可觀察性和透明度。

評估其他代理人

針對 Azure AI Foundry 代理程式服務以外的代理程式，您仍然可以為您選擇的評估工具準備正確的數據來評估這些代理程式。

代理程式通常會發出訊息來與使用者或其他代理程序互動。我們的內建評估工具可以根據query，接受簡單的數據類型，例如中的responseground_truth字串。不過，若要從代理程式訊息擷取這些簡單數據可能是一項挑戰，因為代理程式和架構差異的複雜互動模式。例如，如前所述，單一使用者查詢可以觸發一長串的代理程式訊息清單，通常是叫用多個工具呼叫。

如範例所示，我們特別為這些內建評估工具IntentResolutionToolCallAccuracy啟用代理程式訊息支援，TaskAdherence以評估代理程式工作流程的這些層面。這些評估工具會採用 tool_calls 或 tool_definitions 作為代理程式唯一的參數。

評估者	`query`	`response`	`tool_calls`	`tool_definitions`
`IntentResolutionEvaluator`	必填： `Union[str, list[Message]]`	必填： `Union[str, list[Message]]`	N/A	自選： `list[ToolCall]`
`ToolCallAccuracyEvaluator`	必填： `Union[str, list[Message]]`	自選： `Union[str, list[Message]]`	自選： `Union[dict, list[ToolCall]]`	必填： `list[ToolDefinition]`
`TaskAdherenceEvaluator`	必填： `Union[str, list[Message]]`	必填： `Union[str, list[Message]]`	N/A	自選： `list[ToolCall]`

Message： dict openai 樣式的訊息，描述代理程式與用戶的互動，其中 query 必須包含系統訊息做為第一則訊息。
ToolCall： dict 指定在代理程式與使用者互動期間叫用的工具呼叫。
ToolDefinition： dict 描述代理程式可用的工具。

針對 ToolCallAccuracyEvaluator，必須提供 response 或 tool_calls。

我們將示範這兩種數據格式的一些範例：簡單的代理程序數據和代理程式訊息。不過，由於這些評估工具的獨特需求，我們建議參考範例筆記本，以說明每個評估工具的可能輸入路徑。

與其他內建 AI 輔助質量評估工具一樣， IntentResolutionEvaluator 並 TaskAdherenceEvaluator 輸出 likert 分數（整數 1-5;較高的分數更好）。 ToolCallAccuracyEvaluator 會根據使用者查詢輸出所有工具呼叫的傳遞速率（介於 0-1 之間的浮點數）。為了進一步改善可理解性，所有評估工具都接受二進位臨界值，並輸出兩個新的索引鍵。針對二進位化閾值，會設定預設值，並且使用者可以覆寫此設定。這兩個新的索引鍵為：

{metric_name}_result 依據二值化臨界值判定的「通過」或「失敗」字串。
{metric_name}_threshold 默認或用戶所設定的數值二進位化臨界值。

簡單代理數據

簡單代理程序數據格式， query 而且 response 是簡單的 Python 字串。例如：

import os
import json
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import IntentResolutionEvaluator, ResponseCompletenessEvaluator
  
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)
 
intent_resolution_evaluator = IntentResolutionEvaluator(model_config)

# Evaluating query and response as strings
# A positive example. Intent is identified and understood and the response correctly resolves user intent
result = intent_resolution_evaluator(
    query="What are the opening hours of the Eiffel Tower?",
    response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM.",
)
print(json.dumps(result, indent=4))

輸出（如需詳細資訊，請參閱輸出格式）：

{
    "intent_resolution": 5.0,
    "intent_resolution_result": "pass",
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower, which directly addresses the user's query. The information is clear, accurate, and complete, fully resolving the user's intent.",
    "additional_details": {
        "conversation_has_intent": true,
        "agent_perceived_intent": "inquire about the opening hours of the Eiffel Tower",
        "actual_user_intent": "inquire about the opening hours of the Eiffel Tower",
        "correct_intent_detected": true,
        "intent_resolved": true
    }
}

tool_calls的tool_definitions和ToolCallAccuracyEvaluator範例：

import json 

query = "How is the weather in Seattle?"
tool_calls = [{
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "location": "Seattle"
                    }
            },
            {
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "location": "London"
                    }
            }]

tool_definitions = [{
                    "name": "fetch_weather",
                    "description": "Fetches the weather information for the specified location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The location to fetch weather for."
                            }
                        }
                    }
                }]
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definitions)
print(json.dumps(response, indent=4))

輸出（如需詳細資訊，請參閱輸出格式）：

{
    "tool_call_accuracy": 0.5,
    "tool_call_accuracy_result": "fail",
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's query, uses appropriate parameters, and the parameter values are correctly extracted from the conversation. It is likely to provide useful information to advance the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        },
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is not relevant to the user's query about the weather in Seattle and uses a parameter value that is not present or inferred from the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        }
    ]
}

代理人訊息

在代理程式訊息格式中， query 和 response 是openai樣式訊息的清單。具體而言，query 承載過去的代理程式-使用者互動，直到最後一個用戶查詢，並且要求系統訊息（代理程式）位於清單頂端；而 response 則承載代理程式的最後一則訊息，以回應最後一個用戶查詢。範例：

import json

# user asked a question
query = [
    {
        "role": "system",
        "content": "You are a friendly and helpful customer service agent."
    },
    # past interactions omitted 
    # ...
    {
        "createdAt": "2025-03-14T06:14:20Z",
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?"
            }
        ]
    }
]
# the agent emits multiple messages to fulfill the request
response = [
    {
        "createdAt": "2025-03-14T06:14:30Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Hello! Let me quickly look up your account details."
            }
        ]
    },
    {
        "createdAt": "2025-03-14T06:14:35Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "tool_call_20250310_001",
                "name": "get_orders",
                "arguments": {
                    "account_number": "888"
                }
            }
        ]
    },
    # many more messages omitted 
    # ...
    # here is the agent's final response 
    {
        "createdAt": "2025-03-14T06:15:05Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?"
            }
        ]
    }
]

# An example of tool definitions available to the agent 
tool_definitions = [
    {
        "name": "get_orders",
        "description": "Get the list of orders for a given account number.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_number": {
                    "type": "string",
                    "description": "The account number to get the orders for."
                }
            }
        }
    },
    # other tool definitions omitted 
    # ...
]

result = intent_resolution_evaluator(
    query=query,
    response=response,
    # optionally provide the tool definitions
    tool_definitions=tool_definitions 
)
print(json.dumps(result, indent=4))

輸出（如需詳細資訊，請參閱輸出格式）：

{
    "tool_call_accuracy": 0.5,
    "tool_call_accuracy_result": "fail",
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's query, uses appropriate parameters, and the parameter values are correctly extracted from the conversation. It is likely to provide useful information to advance the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        },
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is not relevant to the user's query about the weather in Seattle and uses a parameter value that is not present or inferred from the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        }
    ]
}

此評估架構可協助您在 Azure AI Foundry Agent Service 外部剖析代理程式數據，讓您可以使用我們的評估工具，支援代理程式工作流程的可觀察性。

範例筆記本

現在您已準備好嘗試每個評估工具的範例：

共用方式為

使用 Azure AI 評估 SDK 在本機評估 AI 代理程式 （預覽）

入門指南

評估 Azure AI 代理程式

建立代理程式線程並執行

評估單一代理程式運行

輸出格式

評估多個代理程式執行或線程

評估其他代理人

簡單代理數據

代理人訊息

範例筆記本

相關內容

意見反應

其他資源

使用 Azure AI 評估 SDK 在本機評估 AI 代理程式（預覽）