Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this article, you learn how to run evaluations in the cloud for predeployment testing on a test dataset.
Use cloud evaluations for most scenarios - especially when testing at scale, integrating evaluations into continuous integration and continuous delivery (CI/CD) pipelines, or performing predeployment testing. Running evaluations in the cloud eliminates the need to manage local compute infrastructure and supports large-scale, automated testing workflows. You can also schedule evaluations to run on a recurring basis, or set up continuous evaluation to automatically evaluate sampled agent responses in production.
Cloud evaluation results are stored in your Foundry project. You can review results in the portal, retrieve them through the SDK, or route them to Application Insights if connected. Cloud evaluation supports all Microsoft-curated built-in evaluators and your own custom evaluators. Evaluators are managed in the evaluator catalog with the same project-scope, role-based access control.
Tip
For complete runnable examples, see the Python SDK evaluation samples on GitHub.
How cloud evaluation works
A cloud evaluation has three steps:
- Define what to evaluate. Describe your data shape (the
data_source_config) and the evaluators (testing criteria) that score it. - Create the evaluation. Submit the definition by using
openai_client.evals.create(). - Run it and read the results. Start a run by using
openai_client.evals.runs.create(), poll until it completes, and read the scored results. See Get results for the result schema.
The rest of this section walks through the inputs to step 1: pick a scenario, then choose evaluators.
Choose your starting point
Existing dataset
Use this path when you already have queries and responses collected in a file (or queries plus ground truth) and you just want Foundry to score them. JSONL supports both turn-level rows and conversation-level inputs; CSV is turn-level only.
| Scenario | When to use | Data source type |
|---|---|---|
| Turn-level dataset evaluation | Each row is one query/response pair, optionally with context or ground_truth. |
jsonl or csv |
| Conversation-level dataset evaluation (preview) | Each row is a conversation expressed as a messages array. |
jsonl |
Data in Foundry or Application Insights
Use this path when your agent is already running and you want to evaluate what actually happened. Instead of moving data out, you point Foundry at the data where it already lives - by Foundry response ID or by Application Insights trace or conversation ID.
| Scenario | When to use | Data source type |
|---|---|---|
| Agent response evaluation | Your agent runs in Foundry and you have response IDs to score. | azure_ai_responses |
| Turn-level trace evaluation (preview) | Your agent emits OpenTelemetry traces to Application Insights - including non-Foundry frameworks like LangChain or custom OpenTelemetry-instrumented agents. Each trace is scored independently. | azure_ai_trace_data_source_preview |
| Conversation-level trace evaluation (preview) | Same trace sources, but score full conversations - by conversation ID or by agent filter with sampling. | azure_ai_trace_data_source_preview |
Inputs without responses
Use this path when you have the inputs but no responses yet. Foundry generates responses against a model or agent target at evaluation time, then scores them. Pick a row based on whether your input is queries (sent as individual turns) or scenario descriptions (used to drive a conversation-level interaction).
| Scenario | When to use | Data source / target |
|---|---|---|
| Model Target completions | You have queries and want to evaluate responses from a model deployment. | azure_ai_target_completions → azure_ai_model |
| Agent Target completions | You have queries and want to evaluate responses from a Foundry agent. | azure_ai_target_completions → azure_ai_agent |
| Conversation simulation (preview) | You have scenario descriptions (no queries); Foundry simulates a user driving a conversation-level interaction with the agent. | azure_ai_target_completions → azure_ai_agent |
No data yet
Use this path when you're building a new model or agent and haven't collected any inputs. Foundry generates the test data from scratch - choose synthetic queries for broad quality coverage or adversarial prompts for safety testing.
| Scenario | When to use | Data source / target |
|---|---|---|
| Synthetic data evaluation (preview) | You want quality coverage beyond what you'd write by hand. Foundry generates test queries, sends them to the target, and scores responses. | azure_ai_synthetic_data_gen_preview → azure_ai_model or azure_ai_agent |
| Red team evaluation | You want automated adversarial testing - Foundry generates jailbreaks and harmful-content prompts and scores how the target responds. | azure_ai_red_team → azure_ai_model or azure_ai_agent |
Choose evaluators
Each scenario binds evaluators to fields in your data through column mappings. The available fields depend on the data source. Dataset scenarios expose your custom item fields, while target-generated scenarios also expose the model or agent response via a sample schema. The per-scenario subsections later in this article show the column mappings for each case.
For an overview of available evaluators and how to pick them, see built-in evaluators and custom evaluators.
Prerequisites
An Azure OpenAI deployment with a GPT model that supports chat completion (for example,
gpt-5-mini).Foundry User role on the Foundry project.
Important
The Foundry RBAC roles were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner, and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner, and Azure AI Project Manager. You might still see the previous names in some places while the rename rolls out. The role IDs and core permissions are unchanged by the rename.
Optionally, you can use your own storage account to run evaluations.
Note
Some evaluation features have regional restrictions. See supported regions for details.
Get started
Install the SDK and set up your client:
pip install "azure-ai-projects>=2.2.0"
import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from openai.types.eval_create_params import DataSourceConfigCustom
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContent,
SourceFileContentContent,
SourceFileID,
)
# Azure AI Project endpoint
# Example: https://<account_name>.services.ai.azure.com/api/projects/<project_name>
endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]
# Model deployment name (for AI-assisted evaluators)
# Example: gpt-5-mini
model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "")
# Dataset details (optional, for reusing existing datasets)
dataset_name = os.environ.get("DATASET_NAME", "")
dataset_version = os.environ.get("DATASET_VERSION", "1")
# Create the project client
project_client = AIProjectClient(
endpoint=endpoint,
credential=DefaultAzureCredential(),
)
# Get the OpenAI client for evaluation API
openai_client = project_client.get_openai_client()
Prepare input data
Most evaluation scenarios require input data. You can provide data in two ways:
Upload a dataset (recommended)
Upload a JSONL or CSV file to create a versioned dataset in your Foundry project. Datasets support versioning and reuse across multiple evaluation runs. Use this approach for production testing and CI/CD workflows.
Prepare a JSONL file with one JSON object per line containing the fields your evaluators need:
{"query": "What is machine learning?", "response": "Machine learning is a subset of AI.", "ground_truth": "Machine learning is a type of AI that learns from data."}
{"query": "Explain neural networks.", "response": "Neural networks are computing systems inspired by biological neural networks.", "ground_truth": "Neural networks are a set of algorithms modeled after the human brain."}
Or prepare a CSV file with column headers matching your evaluator fields:
query,response,ground_truth
What is machine learning?,Machine learning is a subset of AI.,Machine learning is a type of AI that learns from data.
Explain neural networks.,Neural networks are computing systems inspired by biological neural networks.,Neural networks are a set of algorithms modeled after the human brain.
# Upload a local JSONL file. Skip this step if you already have a dataset registered.
data_id = project_client.datasets.upload_file(
name=dataset_name,
version=dataset_version,
file_path="./evaluate_test_data.jsonl",
).id
Provide data inline
For quick experimentation with small test sets, provide data directly in the evaluation request using file_content.
source = SourceFileContent(
type="file_content",
content=[
SourceFileContentContent(
item={
"query": "How can I safely de-escalate a tense situation?",
"ground_truth": "Encourage calm communication, seek help if needed, and avoid harm.",
}
),
SourceFileContentContent(
item={
"query": "What is the largest city in France?",
"ground_truth": "Paris",
}
),
],
)
Pass source as the "source" field in your data source configuration when creating a run. The scenario sections that follow use file_id by default.
Dataset evaluation
Evaluate pre-computed responses in a JSONL file using the jsonl data source type. This scenario is useful when you already have model outputs and want to assess their quality.
Tip
Before you begin, complete Get started and Prepare input data.
Define the data schema and evaluators
Specify the schema that matches your JSONL fields, and select the evaluators (testing criteria) to run. Use the data_mapping parameter to connect fields from your input data to evaluator parameters with {{item.field}} syntax. Always include data_mapping with the required input fields for each evaluator. Your field names must match those in your JSONL file — for example, if your data has "question" instead of "query", use "{{item.question}}" in the mapping. For the required parameters per evaluator, see built-in evaluators.
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={
"type": "object",
"properties": {
"query": {"type": "string"},
"response": {"type": "string"},
"ground_truth": {"type": "string"},
},
"required": ["query", "response", "ground_truth"],
},
)
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"initialization_parameters": {
"model": model_deployment_name
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
},
},
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
"initialization_parameters": {
"model": model_deployment_name
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
},
},
{
"type": "azure_ai_evaluator",
"name": "f1",
"evaluator_name": "builtin.f1_score",
"data_mapping": {
"response": "{{item.response}}",
"ground_truth": "{{item.ground_truth}}",
},
},
]
Create evaluation and run
Create the evaluation, then start a run against your uploaded dataset. The run executes each evaluator on every row in the dataset.
# Create the evaluation
eval_object = openai_client.evals.create(
name="dataset-evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
# Create a run using the uploaded dataset
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="dataset-run",
data_source=CreateEvalJSONLRunDataSourceParam(
type="jsonl",
source=SourceFileID(
type="file_id",
id=data_id,
),
),
)
For a complete runnable example, see sample_evaluations_builtin_with_dataset_id.py on GitHub. To poll for completion and interpret results, see Get results.
CSV dataset evaluation
Evaluate precomputed responses in a CSV file by using the csv data source type. This scenario works the same way as dataset evaluation but accepts CSV files instead of JSONL. Use CSV when your data is already in spreadsheet or tabular format.
Tip
Before you begin, complete Get started and Prepare input data.
Prepare a CSV file
Create a CSV file with column headers that match the fields your evaluators need. Each row represents one test case.
query,response,context,ground_truth
What is cloud computing?,Cloud computing delivers computing services over the internet.,Cloud computing is a technology for on-demand resource delivery.,Cloud computing is the delivery of computing services including servers storage and databases over the internet.
What is machine learning?,Machine learning is a subset of AI that learns from data.,Machine learning is a branch of artificial intelligence.,Machine learning is a type of AI that enables computers to learn from data without being explicitly programmed.
Explain neural networks.,Neural networks are computing systems inspired by biological neural networks.,Neural networks are used in deep learning.,Neural networks are a set of algorithms modeled after the human brain designed to recognize patterns.
Upload and run
Upload the CSV file as a dataset. Then, create an evaluation by using the csv data source type. The schema definition and evaluator configuration are the same as for JSONL evaluations. The only difference is the "type": "csv" in the data source.
# Upload the CSV file
data_id = project_client.datasets.upload_file(
name="eval-csv-data",
version="1",
file_path="./evaluation_data.csv",
).id
# Define the schema matching your CSV columns
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={
"type": "object",
"properties": {
"query": {"type": "string"},
"response": {"type": "string"},
"context": {"type": "string"},
"ground_truth": {"type": "string"},
},
"required": [],
},
include_sample_schema=True,
)
# Define evaluators with data mappings to CSV columns
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
},
"initialization_parameters": {"model": model_deployment_name},
},
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
},
"initialization_parameters": {"model": model_deployment_name},
},
{
"type": "azure_ai_evaluator",
"name": "f1",
"evaluator_name": "builtin.f1_score",
},
]
# Create the evaluation
eval_object = openai_client.evals.create(
name="CSV evaluation with built-in evaluators",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
# Create a run using the CSV data source type
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="csv-evaluation-run",
data_source={
"type": "csv",
"source": {
"type": "file_id",
"id": data_id,
},
},
)
To poll for completion and interpret results, see Get results.
Model target evaluation
Send queries to a deployed model at runtime. Evaluate the responses by using the azure_ai_target_completions data source type with an azure_ai_model target. Your input data contains queries. The model generates responses, which you then evaluate.
Tip
Before you begin, complete Get started and Prepare input data.
Define the message template and target
The input_messages template controls how queries are sent to the model. Use {{item.query}} to reference fields from your input data. Specify the model to evaluate and optional sampling parameters:
input_messages = {
"type": "template",
"template": [
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.query}}"
}
}
]
}
target = {
"type": "azure_ai_model",
"model": "gpt-5-mini",
"sampling_params": {
"top_p": 1.0,
"max_completion_tokens": 2048,
},
}
Set up evaluators and data mappings
When the model generates responses at runtime, use {{sample.output_text}} in data_mapping to reference the model's output. Use {{item.field}} to reference fields from your input data.
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={
"type": "object",
"properties": {
"query": {"type": "string"},
},
"required": ["query"],
},
include_sample_schema=True,
)
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"initialization_parameters": {
"model": model_deployment_name,
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
},
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
},
]
Create evaluation and run
eval_object = openai_client.evals.create(
name="Model Target Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
data_source = {
"type": "azure_ai_target_completions",
"source": {
"type": "file_id",
"id": data_id,
},
"input_messages": input_messages,
"target": target,
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="model-target-evaluation",
data_source=data_source,
)
For a complete runnable example, see sample_model_evaluation.py on GitHub. To poll for completion and interpret results, see Get results.
Tip
To add another evaluation run, use the same code.
Agent target evaluation
Send queries to a Foundry agent at runtime and evaluate the responses by using the azure_ai_target_completions data source type with an azure_ai_agent target. This scenario works for both prompt agents and hosted agents.
Tip
Before you begin, complete Get started and Prepare input data.
Tip
Hosted agents that use the responses protocol work with the same code samples shown here. For hosted agents that use the invocations protocol, the input_messages format is different. See Hosted agent invocations protocol for details.
Define the message template and target
The input_messages template controls how queries are sent to the agent. Use {{item.query}} to reference fields from your input data. Specify the agent to evaluate by name:
input_messages = {
"type": "template",
"template": [
{
"type": "message",
"role": "developer",
"content": {
"type": "input_text",
"text": "You are a helpful assistant. Answer clearly and safely."
}
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.query}}"
}
}
]
}
target = {
"type": "azure_ai_agent",
"name": "my-agent",
"version": "1" # Optional. Uses latest version if omitted.
}
Set up evaluators and data mappings
When the agent generates responses at runtime, use {{sample.*}} variables in data_mapping to reference the agent's output:
| Variable | Description | Use for |
|---|---|---|
{{sample.output_text}} |
The agent's plain text response. | Evaluators that expect a string response (for example, coherence, violence). |
{{sample.output_items}} |
The agent's structured JSON output, including tool calls. | Evaluators that need full interaction context (for example, task_adherence). |
{{item.field}} |
A field from your input data. | Input fields like query or ground_truth. |
Tip
The query field can contain structured JSON, including system messages and conversation history. Some agent evaluators such as task_adherence use this context for more accurate scoring. For details on query formatting, see agent evaluators.
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={
"type": "object",
"properties": {
"query": {"type": "string"},
},
"required": ["query"],
},
include_sample_schema=True,
)
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"initialization_parameters": {
"model": model_deployment_name,
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
},
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
},
{
"type": "azure_ai_evaluator",
"name": "task_adherence",
"evaluator_name": "builtin.task_adherence",
"initialization_parameters": {
"model": model_deployment_name,
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_items}}",
},
},
]
Create evaluation and run
eval_object = openai_client.evals.create(
name="Agent Target Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
data_source = {
"type": "azure_ai_target_completions",
"source": {
"type": "file_id",
"id": data_id,
},
"input_messages": input_messages,
"target": target,
}
agent_eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="agent-target-evaluation",
data_source=data_source,
)
For a complete runnable example, see sample_agent_evaluation.py on GitHub. To poll for completion and interpret results, see Get results.
Hosted agent invocations protocol
Hosted agents that use the invocations protocol support the same azure_ai_agent target type but use a freeform input_messages format. Instead of the structured template format, provide a JSON object that maps directly to the agent's /invocations request body. Use {{item.*}} placeholders to substitute fields from your input data.
If a hosted agent supports both the responses and invocations protocols, the service defaults to using the invocations protocol.
Define the message format and target
input_messages = {"message": "{{item.query}}"}
target = {
"type": "azure_ai_agent",
"name": "my-hosted-agent", # Replace with your hosted agent name
"version": "1",
}
Create evaluation and run
eval_object = openai_client.evals.create(
name="Hosted Agent Invocations Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
data_source = {
"type": "azure_ai_target_completions",
"source": {
"type": "file_id",
"id": data_id,
},
"input_messages": input_messages,
"target": target,
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="hosted-agent-invocations-evaluation",
data_source=data_source,
)
The evaluator setup and data mappings are the same as for prompt agent evaluation. Use {{sample.output_text}} for the agent's text response and {{sample.output_items}} for the full structured output including tool calls.
Agent response evaluation
Retrieve and evaluate Foundry agent responses by response IDs using the azure_ai_responses data source type. Use this scenario to evaluate specific agent interactions after they occur.
Tip
Before you begin, complete Get started.
A response ID is a unique identifier returned each time a Foundry agent generates a response. You can collect response IDs from agent interactions by using the Responses API or from your application's trace logs. Provide the IDs inline as file content, or upload them as a dataset (see Prepare input data).
Collect response IDs
Each call to the Responses API returns a response object with a unique id field. Collect these IDs from your application's interactions, or generate them directly:
# Generate response IDs by calling a model through the Responses API
response = openai_client.responses.create(
model=model_deployment_name,
input="What is machine learning?",
)
print(response.id) # Example: resp_abc123
You can also collect response IDs from agent interactions in your application's trace logs or monitoring pipeline. Each response ID uniquely identifies a stored response that the evaluation service can retrieve.
Create evaluation and run
data_source_config = {"type": "azure_ai_source", "scenario": "responses"}
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"initialization_parameters": {
"model": model_deployment_name,
},
},
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
},
]
eval_object = openai_client.evals.create(
name="Agent Response Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
data_source = {
"type": "azure_ai_responses",
"item_generation_params": {
"type": "response_retrieval",
"data_mapping": {"response_id": "{{item.resp_id}}"},
"source": {
"type": "file_content",
"content": [
{"item": {"resp_id": "resp_abc123"}},
{"item": {"resp_id": "resp_def456"}},
]
},
},
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="agent-response-evaluation",
data_source=data_source,
)
For a complete runnable example, see sample_agent_response_evaluation.py on GitHub. To poll for completion and interpret results, see Get results.
Trace evaluation (preview)
Evaluate agent interactions that Application Insights already captured. Use the azure_ai_traces data source type. This scenario is useful for post-deployment evaluation of real production traffic. You select traces from your monitoring pipeline and run evaluators against them without replaying any requests.
Important
Trace evaluation is the recommended approach for evaluating agents not built with the Microsoft Foundry Agent Service - including LangChain and custom frameworks. As long as your agent emits OpenTelemetry spans following the GenAI semantic conventions to Application Insights, trace evaluation can assess its interactions by using the same evaluators available for Foundry agents.
Trace evaluation supports two modes:
- By trace IDs - Evaluate specific agent interactions by providing their
operation_Idvalues from Application Insights. - By agent filter - Automatically discover and evaluate recent traces for a given agent, without manually collecting trace IDs.
Tip
Before you begin, complete Get started. This scenario also requires an Application Insights resource connected to your Foundry project.
Intelligent sampling
Trace evaluation supports intelligent sampling, which selects a representative subset of traces for evaluation instead of evaluating every captured trace. Enable this feature by turning on the Intelligent sampling toggle in the Foundry portal when you configure a trace evaluation run. Intelligent sampling reduces evaluation cost while preserving trace diversity - ensuring that edge cases, error paths, and varied conversation patterns are included in the evaluated set.
How intelligent sampling works
The sampling algorithm uses a MinHash farthest-first diversity approach that runs in multiple stages:
- Exact deduplication - Removes duplicate traces from the pool.
- Hard filters - Removes broken sessions, truncated traces, and malformed tool calls that aren't suitable for evaluation.
- Aggregation - Combines trace-level signals into a unified representation.
- MinHash farthest-first selection - Computes locality-sensitive hashes (MinHash signatures) of user text to estimate similarity between traces, then iteratively selects the most dissimilar trace from the remaining pool. Each successive pick maximizes distance from all previously selected traces.
This approach produces significantly higher lexical diversity and broader vocabulary coverage compared to random sampling, which means the evaluated set better represents the full range of agent interactions - including rare, hard, and novel cases that random sampling tends to miss.
Intelligent sampling is particularly effective for:
- Evaluation and benchmarks - Maximizes coverage of the input distribution so evaluation scores reflect real-world diversity.
- Rubric generation - Produces more focused and actionable rubrics by exposing diverse conversation patterns.
- Finetuning dataset curation - Selects traces that help models learn more efficiently.
The algorithm runs entirely on local compute with no extra API calls, so it doesn't incur extra model inference costs beyond the evaluation itself.
Intelligent sampling example
# Eval group for trace-based evaluations
data_source_config = {
"type": "azure_ai_source",
"scenario": "traces",
}
print("Creating trace-based evaluation group")
eval_object = client.evals.create(
name="Trace Evaluation (Agent Smart Filter)",
data_source_config=data_source_config, # type: ignore
testing_criteria=testing_criteria,
)
print(f"Evaluation created (id: {eval_object.id})")
# Compute time window in unix seconds
# Pad end_time by +600s (10 min) to avoid ingestion-delay edge exclusion
now_unix = int(time.time())
end_time = now_unix + 600
start_time = now_unix - (args.lookback_hours * 3600)
# Build trace_source based on mode
trace_source: dict = {
"type": "agent_filter",
"start_time": start_time,
"end_time": end_time,
"max_traces": args.max_traces,
"filter_strategy": "smart_filtering"
}
# Add agent name/version or agent id
trace_source["agent_name"] = agent_name
trace_source["agent_version"] = agent_version
## trace_source["agent_id"] = args.agent_id
data_source = {
"type": "azure_ai_trace_data_source_preview",
"trace_source": trace_source,
}
eval_run = client.evals.runs.create(
eval_id=eval_object.id,
name="trace-evaluation-agent-smart-filter-run",
data_source=data_source, # type: ignore
)
Trace data requirements
Trace evaluation requires your agent to emit spans that follow the OpenTelemetry semantic conventions for generative AI. Specifically, the evaluation service reads invoke_agent spans from Application Insights and extracts conversation data from their attributes.
The following span attributes are used:
| Attribute | Required | Description |
|---|---|---|
gen_ai.operation.name |
Yes | Must equal "invoke_agent". The service ignores all other spans. |
gen_ai.agent.id |
For agent filter mode | Unique agent identifier (format: agent-name:version). |
gen_ai.agent.name |
For agent filter mode | Human-readable agent name. |
gen_ai.input.messages |
For evaluators query inputs | JSON array of input messages following the GenAI semantic conventions message format. Messages with role user or system map to query; messages with role assistant or tool map to response. |
gen_ai.output.messages |
For evaluators query inputs | JSON array of model-generated output messages. All output messages map to response. If output also contains type: tool_call or type: tool_result, it maps to tool_calls. |
gen_ai.tool.definitions |
Optional | JSON array of tool schemas available to the agent. If absent, the service attempts to infer tool definitions from tool call messages, but inferred schemas might be incomplete. |
gen_ai.conversation.id |
Optional | Conversation identifier, passed through to evaluation results for correlation. |
Note
If gen_ai.input.messages and gen_ai.output.messages are empty or missing, quality evaluators (coherence, fluency, relevance, intent resolution) return score=None. Safety evaluators (violence, self-harm, sexual, hate/unfairness) can still produce scores with partial data but they might not produce meaningful results.
For Python agents built with the Azure AI Agent Server SDK, add the [tracing] extra to enable automatic span emission:
pip install "azure-ai-agentserver-core[tracing]"
Prerequisites for trace evaluation
In addition to the general prerequisites, trace evaluation requires:
- An Application Insights resource connected to your Foundry project. See Set up tracing in Microsoft Foundry.
- The project's managed identity must have the Log Analytics Reader role on both the Application Insights resource and its linked Log Analytics workspace.
- The
azure-monitor-queryPython package (only needed if you collect trace IDs manually).
pip install "azure-ai-projects>=2.2.0" azure-monitor-query
Set these environment variables:
APPINSIGHTS_RESOURCE_ID— The Application Insights resource ID (for example,/subscriptions/<subscription_id>/resourceGroups/<rg_name>/providers/Microsoft.Insights/components/<resource_name>).AGENT_ID— The agent identifier emitted by the tracing integration (gen_ai.agent.idattribute), used to filter traces. Format:agent-name:version.TRACE_LOOKBACK_HOURS— (Optional) Number of hours to look back when querying traces. Defaults to1.
Option A: Evaluate by agent filter
The simplest approach is to let the service automatically discover and evaluate recent traces for a specific agent. No manual trace ID collection needed.
import os
agent_id = os.environ["AGENT_ID"] # e.g., "my-weather-agent:1"
trace_lookback_hours = int(os.environ.get("TRACE_LOOKBACK_HOURS", "1"))
# Create the evaluation
data_source_config = {
"type": "azure_ai_source",
"scenario": "traces",
}
eval_object = openai_client.evals.create(
name="Agent Trace Evaluation (by agent)",
data_source_config=data_source_config,
testing_criteria=testing_criteria, # See "Set up evaluators" below
)
# Create a run — the service queries App Insights for matching traces
data_source = {
"type": "azure_ai_traces",
"agent_id": agent_id,
"max_traces": 50, # Maximum number of traces to evaluate
"lookback_hours": trace_lookback_hours,
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="agent-trace-eval-run",
data_source=data_source,
)
print(f"Evaluation run started: {eval_run.id}")
The service filters invoke_agent spans by the gen_ai.agent.id attribute, samples up to max_traces unique trace IDs, and evaluates all spans from those traces.
Option B: Evaluate by trace IDs
For more control, collect specific trace IDs from Application Insights and evaluate them. This method is useful when you want to evaluate a curated set of interactions, such as traces flagged by alerts or sampled for quality review.
Collect trace IDs from Application Insights
Query Application Insights for operation_Id values from your agent's traces. Each operation_Id represents a complete agent interaction:
import os
from datetime import datetime, timedelta, timezone
from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient, LogsQueryStatus
appinsights_resource_id = os.environ["APPINSIGHTS_RESOURCE_ID"]
agent_id = os.environ["AGENT_ID"]
trace_query_hours = int(os.environ.get("TRACE_LOOKBACK_HOURS", "1"))
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=trace_query_hours)
query = f"""dependencies
| where timestamp between (datetime({start_time.isoformat()}) .. datetime({end_time.isoformat()}))
| extend agent_id = tostring(customDimensions["gen_ai.agent.id"])
| where agent_id == "{agent_id}"
| distinct operation_Id"""
credential = DefaultAzureCredential()
logs_client = LogsQueryClient(credential)
response = logs_client.query_resource(
appinsights_resource_id,
query=query,
timespan=None, # Time range is specified in the query itself
)
trace_ids = []
if response.status == LogsQueryStatus.SUCCESS:
for table in response.tables:
for row in table.rows:
trace_ids.append(row[0])
print(f"Found {len(trace_ids)} trace IDs")
Create evaluation and run with trace IDs
# Create the evaluation
data_source_config = {
"type": "azure_ai_source",
"scenario": "traces",
}
eval_object = openai_client.evals.create(
name="Agent Trace Evaluation (by trace IDs)",
data_source_config=data_source_config,
testing_criteria=testing_criteria, # See "Set up evaluators" below
)
# Create a run using the collected trace IDs
data_source = {
"type": "azure_ai_traces",
"trace_ids": trace_ids,
"lookback_hours": trace_query_hours,
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="agent-trace-eval-run",
metadata={
"agent_id": agent_id,
"start_time": start_time.isoformat(),
"end_time": end_time.isoformat(),
},
data_source=data_source,
)
print(f"Evaluation run started: {eval_run.id}")
Set up evaluators and data mappings
When you evaluate traces, the service automatically extracts conversation data from the OpenTelemetry span attributes. Use these field names directly in data_mapping (without the item. or sample. prefixes used in other scenarios):
| Variable | Source attribute | Description |
|---|---|---|
{{item.query}} |
gen_ai.input.messages (user/system roles) |
The user query extracted from the trace. |
{{item.response}} |
gen_ai.input.messages (assistant/tool roles) + gen_ai.output.messages |
The agent's response extracted from the trace. |
{{item.tool_definitions}} |
gen_ai.tool.definitions |
Tool schemas available to the agent. Only required for tool-related evaluators. |
{{item.tool_calls}} |
Extracted from assistant messages in gen_ai.input.messages / gen_ai.output.messages |
Tool calls made by the agent during the interaction. Used by tool evaluators. Only required for tool-related evaluators. |
testing_criteria = [
# Quality evaluators — require query and response from trace data
{
"type": "azure_ai_evaluator",
"name": "intent_resolution",
"evaluator_name": "builtin.intent_resolution",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
"tool_definitions": "{{item.tool_definitions}}",
},
"initialization_parameters": {
"model": model_deployment_name,
},
},
# Tool evaluators — assess tool usage quality
{
"type": "azure_ai_evaluator",
"name": "tool_call_accuracy",
"evaluator_name": "builtin.tool_call_accuracy",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
"tool_calls": "{{item.tool_calls}}",
"tool_definitions": "{{item.tool_definitions}}",
},
"initialization_parameters": {
"model": model_deployment_name,
},
},
# Safety evaluators — work even with partial trace data
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}",
},
"initialization_parameters": {
"threshold": 4,
},
},
]
For a complete runnable example, see sample_evaluations_builtin_with_traces.py on GitHub. To poll for completion and interpret results, see Get results.
Synthetic data evaluation (preview)
Use the azure_ai_synthetic_data_gen_preview data source type to generate synthetic test queries, send them to a deployed model or Foundry agent, and evaluate the responses. Use this scenario when you don't have a test dataset. The service generates queries based on a prompt you provide (and/or from the agent's instructions), runs them against your target, and evaluates the responses.
Tip
Before you begin, complete Get started.
How synthetic data evaluation works
- The service generates synthetic queries based on your
promptand optional seed data files. - Each query is sent to the specified target (model or agent) to generate a response.
- Evaluators score each response using the generated query and response.
- The generated queries are stored as a dataset in your project for reuse.
Parameters
| Parameter | Required | Description |
|---|---|---|
samples_count |
Yes | Maximum number of synthetic test queries to generate. |
model_deployment_name |
Yes | Model deployment to use for generating synthetic queries. Only models with Responses API capability are supported. For availability, see Responses API region availability. |
prompt |
No | Instructions describing the type of queries to generate. Optional when the agent target has instructions configured. |
output_dataset_name |
No | Name for the output dataset where generated queries are stored. If you don't provide a name, the service generates one automatically. |
sources |
No | Seed data files (by file ID) to improve relevance of generated queries. Currently only one file is supported. |
Set up evaluators and data mappings
The synthetic data generator produces queries in the {{item.query}} field. The target generates responses available in {{sample.output_text}}. Map these fields to your evaluators:
data_source_config = {"type": "azure_ai_source", "scenario": "synthetic_data_gen_preview"}
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"initialization_parameters": {
"model": model_deployment_name,
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
},
{
"type": "azure_ai_evaluator",
"name": "violence",
"evaluator_name": "builtin.violence",
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
},
]
Create evaluation and run
Model target
Generate synthetic queries and evaluate a model:
eval_object = openai_client.evals.create(
name="Synthetic Data Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
data_source = {
"type": "azure_ai_synthetic_data_gen_preview",
"item_generation_params": {
"type": "synthetic_data_gen_preview",
"samples_count": 5,
"prompt": "Generate customer service questions about returning defective products",
"model_deployment_name": model_deployment_name,
"output_dataset_name": "my-synthetic-dataset",
},
"target": {
"type": "azure_ai_model",
"model": model_deployment_name,
},
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="synthetic-data-evaluation",
data_source=data_source,
)
You can optionally add a system prompt to shape the target model's behavior. When you use input_messages with synthetic data generation, include only system role messages - the service provides the generated queries as user messages automatically.
data_source = {
"type": "azure_ai_synthetic_data_gen_preview",
"item_generation_params": {
"type": "synthetic_data_gen_preview",
"samples_count": 5,
"prompt": "Generate customer service questions about returning defective products",
"model_deployment_name": model_deployment_name,
},
"target": {
"type": "azure_ai_model",
"model": model_deployment_name,
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "system",
"content": {
"type": "input_text",
"text": "You are a helpful customer service agent. Be empathetic and solution-oriented."
}
}
]
},
}
Agent target
Generate synthetic queries and evaluate a Foundry agent:
data_source = {
"type": "azure_ai_synthetic_data_gen_preview",
"item_generation_params": {
"type": "synthetic_data_gen_preview",
"samples_count": 5,
"prompt": "Generate questions about returning defective products",
"model_deployment_name": model_deployment_name,
},
"target": {
"type": "azure_ai_agent",
"name": agent_name,
"version": agent_version,
},
}
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="synthetic-agent-evaluation",
data_source=data_source,
)
To poll for completion and interpret results, see Get results. The response includes an output_dataset_id property that contains the ID of the generated dataset, which you can use to retrieve or reuse the synthetic data.
Conversation-level evaluation (preview)
Evaluate complete conversations to assess agent quality across entire user interactions - not just individual responses. Use conversation-level evaluation to identify quality problems like incomplete task resolution, user frustration, and tool-call regressions that turn-level evaluation misses.
For example, consider a support agent where the user grows frustrated over multiple turns:
Turn 1 — User: "I need to reset my password." Agent: "I found your account. I'll send a reset link."
Turn 2 — User: "I didn't get the email." Agent: "I've resent the link. Please check spam."
Turn 3 — User: "Still nothing. Can you just reset it directly?" Agent: "I've sent another reset link."
A turn-level evaluator scores only the last response - which is polite and takes action - so it scores well. A conversation-level evaluator grading customer satisfaction across the conversation flags that the agent repeated the same failing action three times without trying an alternative, leaving the user's problem unresolved.
Conversation-level evaluation differs from turn-level evaluation in several ways:
| Aspect | Turn-level | Conversation-level |
|---|---|---|
| Scope | Individual query-response pairs | Complete conversations with multiple exchanges |
| Metrics | Per-response quality and safety | Conversation-level outcomes and user satisfaction |
| Data format | JSONL with query and response fields |
JSONL with messages array containing the full conversation |
| Use case | Testing individual model responses | Testing end-to-end agent experiences |
Conversation-level evaluation supports four data source options:
| Option | When to use | Data source type |
|---|---|---|
| From dataset or inline | You have local conversation traces or test data | jsonl with file_id or file_content |
| By conversation ID | You want to evaluate specific conversations from App Insights | azure_ai_trace_data_source_preview with trace_source |
| By agent filter with sampling | You want to assess overall agent quality across sampled production traffic | azure_ai_trace_data_source_preview with trace_source |
| Simulated conversations | You want to generate synthetic test conversations | azure_ai_target_completions with conversation_gen_preview |
Choose an evaluation level
The evaluation_level parameter on the run determines whether evaluators score individual turns or complete conversations:
| Value | Behavior |
|---|---|
"turn" |
Evaluators score each turn independently. |
"conversation" |
Evaluators score the entire conversation as a whole. |
| (omitted) | Defaults to "turn". |
Important
Evaluator compatibility: Each evaluator supports specific evaluation levels. Check the evaluator's supported_evaluation_levels field in the evaluator catalog.
- Turn-only evaluators (for example,
fluency,relevance) can't be used withevaluation_level="conversation". - Currently, all conversation-level evaluators support both
"turn"and"conversation"levels.
Common errors
| Error | Cause | Solution |
|---|---|---|
| Incompatible evaluation level | Using evaluation_level="conversation" with a turn-only evaluator |
Remove the turn-only evaluator or change to evaluation_level="turn" |
Prepare conversation data
Create a JSONL file where each line contains a complete conversation in the messages field. Each message should include a role (user, assistant, or system) and content. For a complete example, see the conversation evaluation samples in the SDK:
{"messages": [{"role": "user", "content": "What's my account balance?"}, {"role": "assistant", "content": "Your current balance is $1,234.56."}, {"role": "user", "content": "Thanks!"}, {"role": "assistant", "content": "You're welcome! Is there anything else?"}]}
You can also include tool definitions and tool calls if your agent uses tools:
{"messages": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]}
{"messages": [{"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "You can reverse a string in Python by using slicing: string[::-1]"}]}
{"messages": [{"role": "user", "content": "What are the main causes of climate change?"}, {"role": "assistant", "content": "The main causes of climate change are the increase in greenhouse gases in the atmosphere, primarily due to human activities such as burning fossil fuels and deforestation."}]}
{"messages": [{"role": "user", "content": "What's my account balance?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_abc123", "type": "function", "function": {"name": "get_account_balance", "arguments": "{\"account_id\": \"ACCT-7890\"}"}}]}, {"role": "tool", "tool_call_id": "call_abc123", "content": "{ \"balance\": 1234.56, \"currency\": \"USD\" }"}, {"role": "assistant", "content": "Your current balance is 1,234.56."}, {"role": "user", "content": "Thanks!"}, {"role": "assistant", "content": "You're welcome! Is there anything else?"}], "tool_definitions": [{"name": "get_account_balance", "description": "Retrieves the current balance for a customer account", "parameters": {"type": "object", "properties": {"account_id": {"type": "string"}}, "required": ["account_id"]}}]}
{"messages": [{"role": "user", "content": "Explain the theory of relativity in simple terms."}, {"role": "assistant", "content": "Einstein's theory of relativity shows that space and time are interconnected and relative to the observer's frame of reference."}]}
{"messages": [{"role": "user", "content": "What's the weather in Seattle?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_002", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"Seattle, WA\"}"}}]}, {"role": "tool", "tool_call_id": "call_002", "content": "{ \"temperature\": 55, \"condition\": \"Cloudy\" }"}, {"role": "assistant", "content": "It's currently 55F and cloudy in Seattle."}], "tool_definitions": [{"name": "get_weather", "description": "Get the current weather for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}]}
{"messages": [{"role": "user", "content": "What is the tallest mountain in the world?"}, {"role": "assistant", "content": "Mount Everest is the tallest mountain in the world."}]}
{"messages": [{"role": "user", "content": "Is 4 x 2 = 16?"}, {"role": "assistant", "content": "No, 4 x 2 = 8."}]}
{"messages": [{"role": "user", "content": "What is the best Italian desert?"}, {"role": "assistant", "content": "Tiramisu is a popular Italian dessert."}]}
{"messages": [{"role": "user", "content": "What is the chemical formula for water?"}, {"role": "assistant", "content": "The chemical formula for water is H2O."}]}
Define the data schema and evaluators
Specify the schema for your conversation data, "messages", and select evaluators designed for conversation-level evaluation. Conversation-level evaluators assess the entire interaction rather than individual turns.
pip install "azure-ai-projects>=2.2.0"
import os
from openai.types.eval_create_params import DataSourceConfigCustom
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import TestingCriterionAzureAIEvaluator
endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as openai_client,
):
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={
"type": "object",
"properties": {
"messages": {"type": "array"},
"tool_definitions": {"type": "array"},
},
"required": ["messages"],
},
include_sample_schema=False,
)
testing_criteria = [
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="customer_satisfaction",
evaluator_name="builtin.customer_satisfaction",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="task_completion",
evaluator_name="builtin.task_completion",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="conversation_coherence",
evaluator_name="builtin.coherence",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="groundedness",
evaluator_name="builtin.groundedness",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
]
Create evaluation and run
Prep: download sample_data_multiturn_conversations.jsonl
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileID,
)
# Upload conversation data
data_id = project_client.datasets.upload_file(
name="multiturn-conversation-data",
version="1",
file_path="./sample_data_multiturn_conversations.jsonl",
).id
# Create the evaluation
eval_object = openai_client.evals.create(
name="Multi-turn Conversation Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
# Create a run with evaluation_level set to "conversation"
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="multiturn-conversation-run",
data_source=CreateEvalJSONLRunDataSourceParam(
type="jsonl",
source=SourceFileID(
type="file_id",
id=data_id,
),
),
extra_body={"evaluation_level": "conversation"},
)
To poll for completion and interpret results, see Get results.
Evaluate conversations by ID from traces
Evaluate specific conversations from Application Insights by providing their conversation IDs. Use this option to root-cause problems or verify fixes on specific interactions. For example, you can investigate a conversation flagged by an alert or verify a fix for a known issue.
Where to find conversation IDs
Find conversation IDs in:
- Application Insights trace logs UI — Browse to interesting traces and locate the
conversation_idfield in the trace details. - Your application's logging output — If you set
conversation_idexplicitly when creating agent responses, retrieve it from your logs. - OpenTelemetry trace context — The
conversation_idmight also be derived from the traceparent header if your agent uses standard trace context propagation.
Note
Tool definitions are automatically retrieved from the traces or queried from the agent registry. You don't need to provide them in the request.
Parameters for conversation ID lookup
| Parameter | Required | Description |
|---|---|---|
conversation_ids |
Yes | Array of conversation IDs to evaluate. |
lookback_hours |
No | Hours to search back from end_time. Defaults to seven days (168 hours). |
end_time |
No | End of the search window (ISO 8601 format). Defaults to the current time. |
import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import TestingCriterionAzureAIEvaluator
endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
# Provide conversation IDs or trace IDs from App Insights
conversation_ids = ["conversation_1234", "conversation_5678"]
with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as openai_client,
):
# Eval group for trace-based evaluations
data_source_config = {
"type": "azure_ai_source",
"scenario": "traces",
}
testing_criteria = [
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="customer_satisfaction",
evaluator_name="builtin.customer_satisfaction",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="task_completion",
evaluator_name="builtin.task_completion",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="conversation_coherence",
evaluator_name="builtin.coherence",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="groundedness",
evaluator_name="builtin.groundedness",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
]
# Create evaluation with traces scenario
eval_object = openai_client.evals.create(
name="Multi-turn Trace Evaluation (by ID)",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
# Run evaluation on specific conversation IDs
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="multiturn-trace-by-id-run",
data_source={
"type": "azure_ai_trace_data_source_preview",
"trace_source": {
"type": "conversation_id_source",
"conversation_ids": conversation_ids,
},
},
extra_body={"evaluation_level": "conversation"},
)
Note
- Application Insights data ingestion can cause a delay between when traces are generated and when they're available for evaluation. If the query doesn't find traces, wait a few minutes and retry.
- The maximum lookback is 7 days (168 hours). To access older traces, use
start_timeandend_timewithin your App Insights retention limits.
Evaluate sampled conversations by agent filter
Evaluate a sampled set of conversations from Application Insights by filtering on agent name. Use this option to assess overall agent quality across production traffic. For example, run regular quality assessments or monitor for quality degradation in production.
The agent you specify for filtering can be part of a multi-agent conversation. The filter matches any conversation where that agent participated.
Note
Tool definitions are automatically retrieved from the traces or queried from the agent registry. You don't need to provide them in the request.
Agent identity fields
Specify the agent to filter by using one of these formats:
| Format | Example | Description |
|---|---|---|
agent_name + agent_version |
"agent_name": "my-agent", "agent_version": "1" |
Two separate fields. If agent_version is omitted, use the latest version. |
agent_id |
"agent_id": "my-agent:1" |
Single string in "name:version" format. |
Filter strategies
| Strategy | Description |
|---|---|
random_sampling |
(Default) Uniformly random sample up to max_traces conversations. |
smart_filtering |
Service-managed heuristic that biases toward "interesting" traces - conversations with potential problems, edge cases, or anomalies. |
Parameters
| Parameter | Required | Description |
|---|---|---|
agent_name |
Yes | The agent name to filter traces by. |
agent_version |
No | The agent version. If omitted, uses the latest version. |
agent_id |
No | Alternative to agent_name + agent_version. Single string in format "name:version". |
start_time |
Yes | Start of the time window (Unix epoch seconds, UTC). |
end_time |
Yes | End of the time window (Unix epoch seconds, UTC). Pad by +600 seconds to avoid ingestion delay. |
max_traces |
No | Maximum conversations to sample. Defaults to 1,000. |
filter_strategy |
No | "random_sampling" (default) or "smart_filtering" (service-managed heuristic that biases toward interesting traces). |
Important
The time window (end_time - start_time) must be at least 15 minutes (900 seconds). This requirement exists because conversation-level queries apply a 5-minute inactivity buffer on each edge to avoid partial conversations.
import os
import time
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import TestingCriterionAzureAIEvaluator
endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
agent_name = os.environ["FOUNDRY_AGENT_NAME"]
agent_version = os.environ.get("FOUNDRY_AGENT_VERSION", "")
with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as openai_client,
):
# Eval group for trace-based evaluations
data_source_config = {
"type": "azure_ai_source",
"scenario": "traces",
}
testing_criteria = [
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="customer_satisfaction",
evaluator_name="builtin.customer_satisfaction",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="task_completion",
evaluator_name="builtin.task_completion",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="conversation_coherence",
evaluator_name="builtin.coherence",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="groundedness",
evaluator_name="builtin.groundedness",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
]
eval_object = openai_client.evals.create(
name="Multi-turn Trace Evaluation (Agent Filter)",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
# Compute time window in unix seconds
# Pad end_time by +600s (10 min) to avoid ingestion-delay edge exclusion
now_unix = int(time.time())
end_time = now_unix + 600
start_time = now_unix - (24 * 3600) # 24 hours lookback
# Build trace_source with agent filter
trace_source = {
"type": "agent_filter",
"agent_name": agent_name,
"start_time": start_time,
"end_time": end_time,
"max_traces": 5,
}
if agent_version:
trace_source["agent_version"] = agent_version
# Run evaluation on sampled agent conversations
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="multiturn-agent-filter-run",
data_source={
"type": "azure_ai_trace_data_source_preview",
"trace_source": trace_source,
},
extra_body={"evaluation_level": "conversation"},
)
Note
The App Insights query timespan is currently limited to a maximum of 7 days (168 hours). You can't access traces older than 7 days without explicitly providing start_time and end_time within App Insights retention limits.
To poll for completion and interpret results, see Get results.
Conversation simulation
Generate simulated conversations from scenario descriptions and evaluate them at the conversation level. Use this scenario to test your agent's behavior in controlled situations before deployment. The service generates realistic conversations based on your scenario descriptions and then evaluates them.
This approach is useful for:
- Pre-deployment testing: Validate agent behavior across diverse scenarios without real user traffic.
- Edge case coverage: Test scenarios that rarely occur naturally but are important to handle well.
- Regression testing: Ensure agent updates don't degrade performance on known scenarios.
- Scale testing: Generate many conversations quickly to stress-test agent capabilities.
How conversation simulation works
- You provide a dataset of scenario descriptions—each row describes a situation the simulated user tries to accomplish.
- The service uses a simulator model to play the role of the user, interacting with your agent based on the scenario.
- Each scenario generates one or more complete conversations.
- Conversation-level evaluators assess the generated conversations.
- Your project stores both the conversations and evaluation results.
Prepare scenario data
Create a JSONL file where each line describes a scenario for the simulated user. The schema requires id, test_case_description, and desired_num_turns. Include details about the user's goal, context, and any constraints. For a complete example, see the conversation evaluation samples in the SDK.
{"id": "contoso_refund_timeline", "test_case_description": "Customer returned an item to Contoso Electronics 5 days ago and hasn't received their refund yet. They want to know how long Contoso refunds take.", "desired_num_turns": 10}
{"id": "contoso_store_hours_lookup", "test_case_description": "Customer wants to know what time the Contoso Electronics store closes today. Simple single-fact question with possibly one clarifying turn about which location.", "desired_num_turns": 3}
Parameters
| Parameter | Required | Description |
|---|---|---|
num_conversations |
No | Number of conversations to generate per scenario. Defaults to 5, server-side cap of 5. |
max_turns |
No | Maximum number of turns (exchanges) per conversation. Defaults to 10, server-side cap of 20. |
model |
Yes | Model deployment to use for simulating the user. For example, gpt-4.1. |
sampling_params |
No | Sampling parameters for the simulator model, including temperature, top_p, and max_completion_tokens. |
data_mapping |
No | Maps fields from your scenario JSONL to simulation parameters. Common mappings: test_case_description, id, desired_num_turns. |
Define evaluators
Select evaluators designed for conversation-level assessment. The simulated conversations automatically map to the evaluators.
import os
from openai.types.eval_create_params import DataSourceConfigCustom
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import TestingCriterionAzureAIEvaluator, PromptAgentDefinition
endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
agent_name = os.environ.get("FOUNDRY_AGENT_NAME", "")
with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as openai_client,
):
# Simulation uses the same "custom" eval group type as dataset evaluation (S1),
# since the generated conversations follow the same messages schema.
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={
"type": "object",
"properties": {
"messages": {"type": "array"},
},
"required": ["messages"],
},
include_sample_schema=False,
)
testing_criteria = [
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="customer_satisfaction",
evaluator_name="builtin.customer_satisfaction",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="task_completion",
evaluator_name="builtin.task_completion",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="conversation_coherence",
evaluator_name="builtin.coherence",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
TestingCriterionAzureAIEvaluator(
type="azure_ai_evaluator",
name="groundedness",
evaluator_name="builtin.groundedness",
initialization_parameters={"model": model_deployment_name},
data_mapping={"messages": "{{item.messages}}"},
),
]
Create evaluation and run
Prep: download sample_data_simulation_scenarios.jsonl.
# Create (or update) an agent to simulate against
agent = project_client.agents.create_version(
agent_name=agent_name,
definition=PromptAgentDefinition(
model=model_deployment_name,
instructions="You are a helpful customer service agent. Be empathetic and solution-oriented.",
),
)
# Upload scenario data
scenarios_id = project_client.datasets.upload_file(
name="simulation-scenarios",
version="1",
file_path="./sample_data_simulation_scenarios.jsonl",
).id
# Create the evaluation
eval_object = openai_client.evals.create(
name="Multi-turn Conversation Simulation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
# Create a simulation run
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="conversation-simulation-run",
data_source={
"type": "azure_ai_target_completions",
"source": {
"type": "file_id",
"id": scenarios_id,
},
"target": {
"type": "azure_ai_agent",
"name": agent.name,
"version": agent.version,
},
"item_generation_params": {
"type": "conversation_gen_preview",
"model": model_deployment_name,
"num_conversations": 2,
"max_turns": 5,
"sampling_params": {
"temperature": 0.7,
"top_p": 1.0,
"max_completion_tokens": 800,
},
"data_mapping": {
"test_case_description": "test_case_description",
"id": "id",
"desired_num_turns": "desired_num_turns",
},
},
},
extra_body={"evaluation_level": "conversation"},
)
To poll for completion and interpret results, see Get results.
Get results
After an evaluation run completes, retrieve the scored results and review them in the portal or programmatically.
Poll for results
Evaluation runs are asynchronous. Poll the run status until it completes, then retrieve the results:
import time
from pprint import pprint
while True:
run = openai_client.evals.runs.retrieve(
run_id=eval_run.id, eval_id=eval_object.id
)
if run.status in ("completed", "failed"):
break
time.sleep(5)
print("Waiting for eval run to complete...")
# Retrieve results
output_items = list(
openai_client.evals.runs.output_items.list(
run_id=run.id, eval_id=eval_object.id
)
)
pprint(output_items)
print(f"Report URL: {run.report_url}")
Interpret results
For a single data example, all evaluators output the following schema:
- Label: a binary "pass" or "fail" label, similar to a unit test's output. Use this result to facilitate comparisons across evaluators.
- Score: a score from the natural scale of each evaluator. Some evaluators use a fine-grained rubric, scoring on a 5-point scale (quality evaluators) or a 7-point scale (content safety evaluators). Others, like textual similarity evaluators, use F1 scores, which are floats between 0 and 1. Any nonbinary "score" is binarized to "pass" or "fail" in the "label" field based on the "threshold".
- Threshold: any nonbinary scores are binarized to "pass" or "fail" based on a default threshold, which the user can override in the SDK experience.
- Reason: To improve intelligibility, all LLM-judge evaluators also output a reasoning field to explain why a certain score is given.
- Details: (optional) For some evaluators, such as tool_call_accuracy, there might be a "details" field or flags that contain additional information to help users debug their applications.
Example output (single item)
{
"type": "azure_ai_evaluator",
"name": "Coherence",
"metric": "coherence",
"score": 4.0,
"label": "pass",
"reason": "The response is well-structured and logically organized, presenting information in a clear and coherent manner.",
"threshold": 3,
"passed": true
}
Example output (aggregate)
For aggregate results over multiple data examples (a dataset), the average rate of the examples with a "pass" forms the passing rate for that dataset.
{
"eval_id": "eval_abc123",
"run_id": "run_xyz789",
"status": "completed",
"result_counts": {
"passed": 85,
"failed": 15,
"total": 100
},
"per_testing_criteria_results": [
{
"name": "coherence",
"passed": 92,
"failed": 8,
"pass_rate": 0.92
},
{
"name": "relevance",
"passed": 78,
"failed": 22,
"pass_rate": 0.78
}
]
}
Troubleshooting
Job running for a long time
Your evaluation job might remain in the Running state for an extended period. This condition typically occurs when the Azure OpenAI model deployment doesn't have enough capacity, causing the service to retry requests.
Resolution:
- Cancel the current evaluation job by using
openai_client.evals.runs.cancel(run_id, eval_id=eval_id). - Increase the model capacity in the Azure portal.
- Run the evaluation again.
Authentication errors
If you receive a 401 Unauthorized or 403 Forbidden error, verify that:
- Your
DefaultAzureCredentialis configured correctly. If you're using Azure CLI, runaz login. - Your account has the Foundry User role on the Foundry project.
- The project endpoint URL is correct and includes both the account and project names.
Data format errors
If the evaluation fails with a schema or data mapping error:
- Verify your JSONL file has one valid JSON object per line.
- Confirm that field names in
data_mappingmatch the field names in your JSONL file exactly (case-sensitive). - Check that
item_schemaproperties match the fields in your dataset.
Rate limit errors
Tenant, subscription, and project levels rate-limit evaluation run creations. If you receive a 429 Too Many Requests response:
- Check the
retry-afterheader in the response for the recommended wait time. - Review the response body for rate limit details.
- Use exponential backoff when retrying failed requests.
If an evaluation job fails with a 429 error during execution:
- Reduce the size of your evaluation dataset or split it into smaller batches.
- Increase the tokens-per-minute (TPM) quota for your model deployment in the Azure portal.
Agent evaluator tool errors
If an agent evaluator returns an error for unsupported tools:
- Check the supported tools for agent evaluators.
- As a workaround, wrap unsupported tools as user-defined function tools so the evaluator can assess them.