How to Evaluate Agent in AI Foundry

Question

How to Evaluate Agent in AI Foundry

Parul Paul 65

While exploring Azure AI Foundry for Agent Evaluation, I created a synthetic dataset and noticed two fields: query and test_case_description.

I would like to understand the following:

What is the difference between query and test_case_description ?
When I generated the synthetic dataset from the Data from Synthetic Data Generation, the structure seemed different, so I am unsure how these two fields are intended to be used.

How does Automatic Evaluation work with an Agent? Specifically, how are the queries executed and evaluated during the process?

Since RAG functionality is already integrated in Azure AI Foundry, why are metrics such as Similarity, Response Completeness, and Retrieval evaluation not available during Agent evaluation?

Could you please help clarify these points?

Karnam Venkata Rajeswari 3,070 Reputation points Microsoft External Staff Moderator

2026-03-16T08:52:38.93+00:00

Hello Parul Paul

Did you get any chance to review the above response?

Do let me know if you have any further queries.

Thank you
Karnam Venkata Rajeswari 3,070 Reputation points Microsoft External Staff Moderator

2026-03-17T05:45:43.6933333+00:00

Hello Parul Paul,

Just checking in to see if you have got a chance to see my response to your question

Looking forward to your response and appreciate your time on this.

Thank you!
Parul Paul 65 Reputation points

2026-03-17T05:47:16.8633333+00:00

Hi, **Karnam Venkata Rajeswari Thanks for the your response.

**I also want to know about the human evaluation available in AI Foundry for Agent. How it works. Can you please help me on this too.
Parul Paul 65 Reputation points

2026-03-18T04:45:34.4966667+00:00

Hi, **Karnam Venkata Rajeswari

While evaluating the agent, the status is coming as partial. Upon investigation, it appears that the web search tool is not being triggered correctly, meaning the appropriate tool is not being selected. How can we properly configure tool usage for evaluation, and what is the recommended standard prompt for running automatic evaluations?

2 answers

Your answer

Karnam Venkata Rajeswari 3,070 Reputation points Microsoft External Staff Moderator

2026-03-16T08:52:38.93+00:00

Hello Parul Paul

Did you get any chance to review the above response?

Do let me know if you have any further queries.

Thank you
Karnam Venkata Rajeswari 3,070 Reputation points Microsoft External Staff Moderator

2026-03-17T05:45:43.6933333+00:00

Hello Parul Paul,

Just checking in to see if you have got a chance to see my response to your question

Looking forward to your response and appreciate your time on this.

Thank you!
Parul Paul 65 Reputation points

2026-03-17T05:47:16.8633333+00:00

Hi, **Karnam Venkata Rajeswari Thanks for the your response.

**I also want to know about the human evaluation available in AI Foundry for Agent. How it works. Can you please help me on this too.
Parul Paul 65 Reputation points

2026-03-18T04:45:34.4966667+00:00

Hi, **Karnam Venkata Rajeswari

While evaluating the agent, the status is coming as partial. Upon investigation, it appears that the web search tool is not being triggered correctly, meaning the appropriate tool is not being selected. How can we properly configure tool usage for evaluation, and what is the recommended standard prompt for running automatic evaluations?

Answer 1

Hello Parul Paul,

Welcome to Microsoft Q&A and Thank you for reaching out.

As asked difference between query and test_case_description is as below they serve different purposes in Agent Evaluation

query

This is the actual user input that is sent to the Agent during evaluation.
It is the prompt that triggers the Agent’s full execution flow (tools, RAG, reasoning, and response).
During automatic evaluation, only the query field is executed.

test_case_description

This is metadata for humans, not an executable input.
It is intended to describe the scenario, intent, or context of the test case (for example, what behavior is being validated).
It does not affect how the Agent runs or how the response is generated.

So, to summarize , query is the actual prompt that is sent to the agent during an evaluation run , while test_case_description is just metadata — a human-readable summary of the scenario or intent behind that query. It helps human reviewer know what is being tested,but it is not sent to the agent when an automatic evaluation is done.

When using Synthetic Data Generation, the dataset structure may look different because it is optimized for coverage and variety, not execution semantics. Internally, Foundry still maps the generated content into executable query inputs and descriptive metadata.

Automatic Agent Evaluation follows this flow:

Each query is sent to the Agent exactly like a real user request.
The Agent executes its full pipeline:

Instruction following
Tool invocation
RAG (if configured)
Final response generation

The generated response is then evaluated using agent‑level metrics, such as:

Goal completion
Instruction adherence
Response quality and coherence

Under the hood , Foundry’s evaluation service reads each row’s query, fires it off to the agent (including any tool calls you’ve configured), captures the full response (and tool trace) and then runs your selected Agent Evaluators (Intent Resolution, Task Adherence, Tool Call Accuracy, etc.) against that query+response pair.

Please note that The evaluation framework treats the Agent as a black box system and evaluates end‑to‑end behaviour instead of individual internal steps.

Although RAG is integrated into Azure AI Foundry,agent evaluation works at a higher‑level So,metrics like similarity, response completeness, and retrieval quality are designed for prompt‑response evaluation and standalone RAG pipelines

In an agent the retrieval is the only possible step.The Agent may reason, branch, or use tools beyond RAG.Because of this, Foundry does not expose internal retrieval signals during Agent Evaluation.

Please consider using

RAG Evaluation when you want retrieval‑focused metrics
Agent Evaluation when you want to validate overall task success

References

· Run evaluations from the Microsoft Foundry portal - Microsoft Foundry | Microsoft Learn

· Human Evaluation for Microsoft Foundry Agents - Microsoft Foundry | Microsoft Learn

· Agent Evaluators for Generative AI - Microsoft Foundry | Microsoft Learn

· Run evaluations from the Microsoft Foundry portal - Microsoft Foundry | Microsoft Learn

Please let me know if you have any questions.

Thank you!

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Answer 2

AI answer

query vs test_case_description

In Azure AI Foundry evaluations, the important field for driving the agent or model is query. This is the input that is actually sent to the model/agent during evaluation.

When using synthetic data generation (either from the studio UI or via the azure_ai_synthetic_data_gen_preview scenario in cloud evaluation), the service generates synthetic test queries and stores them in a dataset. In that dataset:

query
- Contains the actual test input that will be sent to the target (model or agent).
- In cloud evaluation, this is the field referenced as {{item.query}} in data mappings and message templates.
- For synthetic data evaluation, the documentation explicitly states that the generator produces queries in {{item.query}}.
test_case_description
- Describes the intent or scenario of the test case in natural language (for example, what the query is supposed to test or what behavior is expected).
- It is metadata for humans or for custom evaluators, not the field that is sent to the agent by default.
- Built‑in evaluators and targets use query as the primary input; test_case_description is only used if explicitly mapped in a custom schema or custom evaluator.

So, for built‑in automatic evaluation flows, treat query as the operative input and test_case_description as descriptive context about that test.

How automatic evaluation works with an agent

Automatic evaluation with an agent in Azure AI Foundry follows the same general pattern whether using a pre‑existing dataset or synthetic data:

Define the evaluation and data schema
- An evaluation defines the test data schema and testing criteria (evaluators).
- For agent evaluation, the schema typically includes at least a query field:
```
     data_source_config = {
         "type": "custom",
         "item_schema": {
             "type": "object",
             "properties": {
                 "query": {"type": "string"},
             },
             "required": ["query"],
         },
         "include_sample_schema": True,
     }
```
- include_sample_schema: True ensures that evaluator mappings can reference the agent’s runtime outputs via {{sample.*}}.
Choose the data source scenario
- For agent target evaluation with existing data, use azure_ai_target_completions.
- For synthetic data evaluation (preview), use azure_ai_synthetic_data_gen_preview with parameters like samples_count, model_deployment_name, and prompt. The service:
  1. Generates synthetic queries into {{item.query}}.
  2. Sends each query to the target (model or agent).
  3. Evaluates the responses.
  4. Stores the generated queries as a dataset for reuse.
Send queries to the agent at runtime
- For agent target evaluation, the input_messages template uses {{item.query}} to construct the message(s) sent to the agent.
- The target is configured as an azure_ai_agent:
  - The service sends each query from the dataset to the agent.
  - The agent executes its full logic (including tools, RAG, etc.) and returns a response.
Capture outputs for evaluators
- During the run, the service captures:
  - {{sample.output_text}}: the agent’s plain text reply.
  - {{sample.output_items}}: the full structured response, including tool calls and tool results.
- Evaluators use these fields according to their data mappings:
  - {{item.field}} for input data (for example, {{item.query}}).
  - {{sample.output_text}} for text‑based metrics.
  - {{sample.output_items}} for tool‑aware or context‑aware metrics.
Apply evaluators and score results
- Agent evaluators (for example, Task Completion, Task Adherence, Intent Resolution, Tool Call Accuracy, Tool Selection, etc.) are configured with:
  - Required inputs (for example, query, response, tool_definitions).
  - A deployment_name for the judge model (AI‑assisted evaluators).
- Each evaluator returns a score (often on a 1–5 Likert scale) plus label, threshold, and pass/fail status.

Why RAG metrics (Similarity, Response Completeness, Retrieval) are not shown in agent evaluation

RAG‑specific evaluators are available, but they are configured based on the data schema and mappings, not automatically surfaced just because the agent uses RAG.

RAG evaluators include:

Groundedness / Groundedness Pro
Relevance
Response Completeness (preview)
Retrieval
Document Retrieval

Each requires specific inputs and mappings, for example:

Groundedness: (response, context) or (query, response).
Response Completeness: ground_truth, response.
Retrieval: query, context.
Document Retrieval: retrieval_ground_truth, retrieved_documents.

In an agent target evaluation:

The agent’s RAG behavior (retrieval, similarity search, etc.) is internal to the agent.
The evaluation service only sees what is exposed in the agent’s response (output_text / output_items) and any fields in the dataset.
To use RAG evaluators with an agent, the evaluation must:
- Provide or expose the necessary fields (for example, retrieved context, ground truth, or retrieval ground truth) in the dataset or via the agent’s structured output.
- Map those fields explicitly in the evaluator configuration.

If those fields are not present or mapped, RAG metrics such as Similarity, Response Completeness, and Retrieval will not appear as available options in the agent evaluation configuration, even though the agent itself uses RAG internally.

In other words:

RAG functionality in the agent does not automatically provide the evaluation inputs required by RAG evaluators.
To use RAG metrics with an agent, the evaluation must be set up so that the dataset and/or agent outputs expose context, ground_truth, retrieved_documents, etc., and those are wired into the RAG evaluators via data_mapping.

References:

AI-generated content may be incorrect. Read our transparency notes for more information.

Share via

How to Evaluate Agent in AI Foundry

2 answers

Your answer