Share via

How to Evaluate Agent in AI Foundry

Parul Paul 65 Reputation points
2026-03-11T11:18:15.38+00:00

While exploring Azure AI Foundry for Agent Evaluation, I created a synthetic dataset and noticed two fields: query and test_case_description.

I would like to understand the following:

What is the difference between query and test_case_description ?
When I generated the synthetic dataset from the Data from Synthetic Data Generation, the structure seemed different, so I am unsure how these two fields are intended to be used.

How does Automatic Evaluation work with an Agent? Specifically, how are the queries executed and evaluated during the process?

Since RAG functionality is already integrated in Azure AI Foundry, why are metrics such as Similarity, Response Completeness, and Retrieval evaluation not available during Agent evaluation?

Could you please help clarify these points?

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform


2 answers

Sort by: Most helpful
  1. Karnam Venkata Rajeswari 3,070 Reputation points Microsoft External Staff Moderator
    2026-03-13T18:18:22.3066667+00:00

    Hello Parul Paul,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    As asked difference between query and test_case_description is as below they serve different purposes in Agent Evaluation

    query

    • This is the actual user input that is sent to the Agent during evaluation.
    • It is the prompt that triggers the Agent’s full execution flow (tools, RAG, reasoning, and response).
    • During automatic evaluation, only the query field is executed.

    test_case_description

    • This is metadata for humans, not an executable input.
    • It is intended to describe the scenario, intent, or context of the test case (for example, what behavior is being validated).
    • It does not affect how the Agent runs or how the response is generated.

    So, to summarize , query is the actual prompt that is sent to the  agent during an evaluation run , while test_case_description is just metadata — a human-readable summary of the scenario or intent behind that query. It helps  human reviewer know what is being tested,but it is not sent to the agent when an automatic evaluation is done.

    When using Synthetic Data Generation, the dataset structure may look different because it is optimized for coverage and variety, not execution semantics. Internally, Foundry still maps the generated content into executable query inputs and descriptive metadata.

    Automatic Agent Evaluation follows this flow:

    1. Each query is sent to the Agent exactly like a real user request.
    2. The Agent executes its full pipeline:
    • Instruction following
    • Tool invocation
    • RAG (if configured)
    • Final response generation
    1. The generated response is then evaluated using agent‑level metrics, such as:
    •  Goal completion
    • Instruction adherence
    • Response quality and coherence

    Under the hood , Foundry’s evaluation service reads each row’s query, fires it off to the agent (including any tool calls you’ve configured), captures the full response (and tool trace)  and then runs your selected Agent Evaluators (Intent Resolution, Task Adherence, Tool Call Accuracy, etc.) against that query+response pair.

    Please note that The evaluation framework treats the Agent as a black box system and evaluates end‑to‑end behaviour instead of individual internal steps.

    Although RAG is integrated into Azure AI Foundry,agent evaluation works at a higher‑level So,metrics like similarity, response completeness, and retrieval quality are designed for prompt‑response evaluation and standalone RAG pipelines

    In an agent the retrieval is the only possible step.The Agent may reason, branch, or use tools beyond RAG.Because of this, Foundry does not expose internal retrieval signals during Agent Evaluation.

    Please consider using

    • RAG Evaluation when you want retrieval‑focused metrics
    • Agent Evaluation when you want to validate overall task success

    References

    ·       Run evaluations from the Microsoft Foundry portal - Microsoft Foundry | Microsoft Learn

    ·       Human Evaluation for Microsoft Foundry Agents - Microsoft Foundry | Microsoft Learn

    ·       Agent Evaluators for Generative AI - Microsoft Foundry | Microsoft Learn

    ·       Run evaluations from the Microsoft Foundry portal - Microsoft Foundry | Microsoft Learn

    Please let me know if you have any questions.

    Thank you!

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

    Was this answer helpful?

    1 person found this answer helpful.
    0 comments No comments

  2. AI answer

    2026-03-11T11:18:51.5866667+00:00

    query vs test_case_description

    In Azure AI Foundry evaluations, the important field for driving the agent or model is query. This is the input that is actually sent to the model/agent during evaluation.

    When using synthetic data generation (either from the studio UI or via the azure_ai_synthetic_data_gen_preview scenario in cloud evaluation), the service generates synthetic test queries and stores them in a dataset. In that dataset:

    • query
      • Contains the actual test input that will be sent to the target (model or agent).
      • In cloud evaluation, this is the field referenced as {{item.query}} in data mappings and message templates.
      • For synthetic data evaluation, the documentation explicitly states that the generator produces queries in {{item.query}}.
    • test_case_description
      • Describes the intent or scenario of the test case in natural language (for example, what the query is supposed to test or what behavior is expected).
      • It is metadata for humans or for custom evaluators, not the field that is sent to the agent by default.
      • Built‑in evaluators and targets use query as the primary input; test_case_description is only used if explicitly mapped in a custom schema or custom evaluator.

    So, for built‑in automatic evaluation flows, treat query as the operative input and test_case_description as descriptive context about that test.

    How automatic evaluation works with an agent

    Automatic evaluation with an agent in Azure AI Foundry follows the same general pattern whether using a pre‑existing dataset or synthetic data:

    1. Define the evaluation and data schema
      • An evaluation defines the test data schema and testing criteria (evaluators).
      • For agent evaluation, the schema typically includes at least a query field:
             data_source_config = {
                 "type": "custom",
                 "item_schema": {
                     "type": "object",
                     "properties": {
                         "query": {"type": "string"},
                     },
                     "required": ["query"],
                 },
                 "include_sample_schema": True,
             }
        
      • include_sample_schema: True ensures that evaluator mappings can reference the agent’s runtime outputs via {{sample.*}}.
    2. Choose the data source scenario
      • For agent target evaluation with existing data, use azure_ai_target_completions.
      • For synthetic data evaluation (preview), use azure_ai_synthetic_data_gen_preview with parameters like samples_count, model_deployment_name, and prompt. The service:
        1. Generates synthetic queries into {{item.query}}.
        2. Sends each query to the target (model or agent).
        3. Evaluates the responses.
        4. Stores the generated queries as a dataset for reuse.
    3. Send queries to the agent at runtime
      • For agent target evaluation, the input_messages template uses {{item.query}} to construct the message(s) sent to the agent.
      • The target is configured as an azure_ai_agent:
        • The service sends each query from the dataset to the agent.
        • The agent executes its full logic (including tools, RAG, etc.) and returns a response.
    4. Capture outputs for evaluators
      • During the run, the service captures:
        • {{sample.output_text}}: the agent’s plain text reply.
        • {{sample.output_items}}: the full structured response, including tool calls and tool results.
      • Evaluators use these fields according to their data mappings:
        • {{item.field}} for input data (for example, {{item.query}}).
        • {{sample.output_text}} for text‑based metrics.
        • {{sample.output_items}} for tool‑aware or context‑aware metrics.
    5. Apply evaluators and score results
      • Agent evaluators (for example, Task Completion, Task Adherence, Intent Resolution, Tool Call Accuracy, Tool Selection, etc.) are configured with:
        • Required inputs (for example, query, response, tool_definitions).
        • A deployment_name for the judge model (AI‑assisted evaluators).
      • Each evaluator returns a score (often on a 1–5 Likert scale) plus label, threshold, and pass/fail status.

    Why RAG metrics (Similarity, Response Completeness, Retrieval) are not shown in agent evaluation

    RAG‑specific evaluators are available, but they are configured based on the data schema and mappings, not automatically surfaced just because the agent uses RAG.

    RAG evaluators include:

    • Groundedness / Groundedness Pro
    • Relevance
    • Response Completeness (preview)
    • Retrieval
    • Document Retrieval

    Each requires specific inputs and mappings, for example:

    • Groundedness: (response, context) or (query, response).
    • Response Completeness: ground_truth, response.
    • Retrieval: query, context.
    • Document Retrieval: retrieval_ground_truth, retrieved_documents.

    In an agent target evaluation:

    • The agent’s RAG behavior (retrieval, similarity search, etc.) is internal to the agent.
    • The evaluation service only sees what is exposed in the agent’s response (output_text / output_items) and any fields in the dataset.
    • To use RAG evaluators with an agent, the evaluation must:
      • Provide or expose the necessary fields (for example, retrieved context, ground truth, or retrieval ground truth) in the dataset or via the agent’s structured output.
      • Map those fields explicitly in the evaluator configuration.

    If those fields are not present or mapped, RAG metrics such as Similarity, Response Completeness, and Retrieval will not appear as available options in the agent evaluation configuration, even though the agent itself uses RAG internally.

    In other words:

    • RAG functionality in the agent does not automatically provide the evaluation inputs required by RAG evaluators.
    • To use RAG metrics with an agent, the evaluation must be set up so that the dataset and/or agent outputs expose context, ground_truth, retrieved_documents, etc., and those are wired into the RAG evaluators via data_mapping.

    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.