Tutorial: Part 3 - Evaluate a custom chat application with the Microsoft Foundry SDK

Note

This document refers to the Microsoft Foundry (classic) portal.

🔍 View the Microsoft Foundry (new) documentation to learn about the new portal.

In this tutorial, you evaluate the chat app you built in Part 2 of the tutorial series. You assess your app's quality across multiple metrics and then iterate on improvements. In this part, you:

Create an evaluation dataset
Evaluate the chat app with Azure AI evaluators
Iterate and improve your app

This tutorial builds on Part 2: Build a custom chat app with the Microsoft Foundry SDK.

Prerequisites

Important

This article provides legacy support for hub-based projects. It will not work for Foundry projects. See How do I know which type of project I have?

SDK compatibility note: Code examples require a specific Microsoft Foundry SDK version. If you encounter compatibility issues, consider migrating from a hub-based to a Foundry project.

An Azure account with an active subscription. If you don't have one, create a free Azure account, which includes a free trial subscription.
If you don't have one, create a hub-based project.

Complete Part 2 of the tutorial series to build the chat application.
Use the same hub-based project you created in Part 1.
Azure AI permissions: Owner or Contributor role to modify model endpoint rate limits and run evaluation jobs.
Make sure you complete the steps to add telemetry logging from Part 2.

Create evaluation dataset

Use the following evaluation dataset, which contains example questions and expected answers. Use this dataset with an evaluator and the get_chat_response() target function to assess your chat app's performance across relevance, groundedness, and coherence metrics.

Create a file named chat_eval_data.jsonl in your assets folder.

Paste this dataset into the file:

{"query": "Which tent is the most waterproof?", "truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"}
{"query": "Which camping table holds the most weight?", "truth": "The Adventure Dining Table has a higher weight capacity than all of the other camping tables mentioned"}
{"query": "How much do the TrailWalker Hiking Shoes cost? ", "truth": "The Trailewalker Hiking Shoes are priced at $110"}
{"query": "What is the proper care for trailwalker hiking shoes? ", "truth": "After each use, remove any dirt or debris by brushing or wiping the shoes with a damp cloth."}
{"query": "What brand is TrailMaster tent? ", "truth": "OutdoorLiving"}
{"query": "How do I carry the TrailMaster tent around? ", "truth": " Carry bag included for convenient storage and transportation"}
{"query": "What is the floor area for Floor Area? ", "truth": "80 square feet"}
{"query": "What is the material for TrailBlaze Hiking Pants?", "truth": "Made of high-quality nylon fabric"}
{"query": "What color does TrailBlaze Hiking Pants come in?", "truth": "Khaki"}
{"query": "Can the warrenty for TrailBlaze pants be transfered? ", "truth": "The warranty is non-transferable and applies only to the original purchaser of the TrailBlaze Hiking Pants. It is valid only when the product is purchased from an authorized retailer."}
{"query": "How long are the TrailBlaze pants under warranty for? ", "truth": " The TrailBlaze Hiking Pants are backed by a 1-year limited warranty from the date of purchase."}
{"query": "What is the material for PowerBurner Camping Stove? ", "truth": "Stainless Steel"}
{"query": "Is France in Europe?", "truth": "Sorry, I can only queries related to outdoor/camping gear and equipment"}

References: JSONL format for evaluation datasets.

Evaluate with Azure AI evaluators

Create an evaluation script that generates a target function wrapper, loads your dataset, runs the evaluation, and logs results to your Foundry project.

Create a file named evaluate.py in your main folder.

Add the following code to import the required libraries, create a project client, and configure some settings:

import os
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import ConnectionType
from azure.ai.evaluation import evaluate, GroundednessEvaluator
from azure.identity import DefaultAzureCredential

from chat_with_products import chat_with_products

# load environment variables from the .env file at the root of this repo
from dotenv import load_dotenv

load_dotenv()

# create a project client using environment variables loaded from the .env file
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)

connection = project.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI, include_credentials=True)

evaluator_model = {
    "azure_endpoint": connection.endpoint_url,
    "azure_deployment": os.environ["EVALUATION_MODEL"],
    "api_version": "2024-06-01",
    "api_key": connection.key,
}

groundedness = GroundednessEvaluator(evaluator_model)

References: AIProjectClient, DefaultAzureCredential, azure-ai-evaluation.

Add code to create a wrapper function that implements the evaluation interface for query and response evaluation:

def evaluate_chat_with_products(query):
    response = chat_with_products(messages=[{"role": "user", "content": query}])
    return {"response": response["message"].content, "context": response["context"]["grounding_data"]}

References: azure-ai-evaluation, evaluation target functions.

Finally, add code to run the evaluation, view the results locally, and get a link to the evaluation results in Foundry portal:

# Evaluate must be called inside of __main__, not on import
if __name__ == "__main__":
    from config import ASSET_PATH

    # workaround for multiprocessing issue on linux
    from pprint import pprint
    from pathlib import Path
    import multiprocessing
    import contextlib

    with contextlib.suppress(RuntimeError):
        multiprocessing.set_start_method("spawn", force=True)

    # run evaluation with a dataset and target function, log to the project
    result = evaluate(
        data=Path(ASSET_PATH) / "chat_eval_data.jsonl",
        target=evaluate_chat_with_products,
        evaluation_name="evaluate_chat_with_products",
        evaluators={
            "groundedness": groundedness,
        },
        evaluator_config={
            "default": {
                "query": {"${data.query}"},
                "response": {"${target.response}"},
                "context": {"${target.context}"},
            }
        },
        azure_ai_project=project.scope,
        output_path="./myevalresults.json",
    )

    tabular_result = pd.DataFrame(result.get("rows"))

    pprint("-----Summarized Metrics-----")
    pprint(result["metrics"])
    pprint("-----Tabular Result-----")
    pprint(tabular_result)
    pprint(f"View evaluation results in AI Studio: {result['studio_url']}")

References: azure-ai-evaluation, AIProjectClient.

Configure the evaluation model

The evaluation script calls the model many times. Consider increasing the number of tokens per minute for the evaluation model.

In Part 1 of this tutorial series, you created an .env file that specifies the name of the evaluation model, gpt-4o-mini. Try to increase the tokens per minute limit for this model, if you have available quota. If you don't have enough quota to increase the value, don't worry. The script is designed to handle limit errors.

In your project in Foundry portal, select Models + endpoints.
Select gpt-4o-mini.
Select Edit.
If you have quota, increase the Tokens per Minute Rate Limit to 30 or more.
Select Save and close.

Run the evaluation script

From your console, sign in to your Azure account by using the Azure CLI:
```
az login
```
Install the required packages:
```
pip install openai
pip install azure-ai-evaluation[remote]
```
References: azure-ai-evaluation SDK, Evaluation SDK documentation.

Verify your evaluation setup

Before running the full evaluation (which takes 5–10 minutes), verify that the SDK and your project connection are working by running this quick test:

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

# Test that you can connect to your project
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)
print("Evaluation SDK is ready! You can now run evaluate.py")

If you see "Evaluation SDK is ready!", your setup is complete and you can proceed.

References: AIProjectClient, DefaultAzureCredential.

Start the evaluation

Run the evaluation script:
```
python evaluate.py
```

The evaluation takes 5–10 minutes to complete. You might see timeout warnings and rate-limit errors. The script handles these errors automatically and continues processing.

Interpret the evaluation output

In the console output, you see an answer for each question, followed by a table with summarized metrics showing relevance, groundedness, and coherence scores. Scores range from 0 (worst) to 4 (best) for GPT-assisted metrics. Look for low groundedness scores to identify responses that aren't well-supported by the reference documents, and low relevance scores to identify off-topic responses.

You might see many WARNING:opentelemetry.attributes: messages and timeout errors. You can safely ignore these messages. They don't affect the evaluation results. The evaluation script is designed to handle rate-limit errors and continue processing.

The evaluation results output also includes a link to view detailed results in the Foundry portal, where you can compare evaluation runs side-by-side and track improvements over time.

====================================================
'-----Summarized Metrics-----'
{'groundedness.gpt_groundedness': 1.6666666666666667,
 'groundedness.groundedness': 1.6666666666666667}
'-----Tabular Result-----'
                                     outputs.response  ... line_number
0   Could you specify which tent you are referring...  ...           0
1   Could you please specify which camping table y...  ...           1
2   Sorry, I only can answer queries related to ou...  ...           2
3   Could you please clarify which aspects of care...  ...           3
4   Sorry, I only can answer queries related to ou...  ...           4
5   The TrailMaster X4 Tent comes with an included...  ...           5
6                                            (Failed)  ...           6
7   The TrailBlaze Hiking Pants are crafted from h...  ...           7
8   Sorry, I only can answer queries related to ou...  ...           8
9   Sorry, I only can answer queries related to ou...  ...           9
10  Sorry, I only can answer queries related to ou...  ...          10
11  The PowerBurner Camping Stove is designed with...  ...          11
12  Sorry, I only can answer queries related to ou...  ...          12

[13 rows x 8 columns]
('View evaluation results in Foundry portal: '
 'https://xxxxxxxxxxxxxxxxxxxxxxx')

Iterate and improve

The evaluation results reveal that responses often aren't well-grounded in the reference documents. To improve groundedness, modify your system prompt in the assets/grounded_chat.prompty file to encourage the model to use the reference documents more directly.

Current prompt (problematic):

If the question is not related to outdoor/camping gear and clothing, just say 'Sorry, I only can answer queries related to outdoor/camping gear and clothing. So, how can I help?'
If the question is related to outdoor/camping gear and clothing but vague, ask clarifying questions.

Improved prompt:

If the question is related to outdoor/camping gear and clothing, answer based on the reference documents provided.
If you cannot find information in the reference documents, say: 'I don't have information about that specific topic. Let me help with related products or try a different question.'
For vague questions, ask clarifying questions to better assist.

After updating the prompt:

Save the file.
Run the evaluation script again:
```
python evaluate.py
```
Compare the new evaluation results to the previous run. You should see improved groundedness scores.

Try additional modifications like:

Changing the system prompt to focus on accuracy over completeness
Testing with a different model (for example, gpt-4-turbo if available)
Adjusting the context retrieval to return more relevant documents

Each iteration helps you understand which changes improve specific metrics.

Clean up resources

To avoid incurring unnecessary Azure costs, delete the resources you created in this tutorial if they're no longer needed. To manage resources, you can use the Azure portal.

Learn more about the Microsoft Foundry SDK

คำติชม

หน้านี้มีประโยชน์หรือไม่

Last updated on 2026-01-09