Evaluate your Generative AI application locally with the Azure AI Evaluation SDK

2025-05-19

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

To thoroughly assess the performance of your generative AI application when applied to a substantial dataset, you can evaluate a Generative AI application in your development environment with the Azure AI evaluation SDK. Given either a test dataset or a target, your generative AI application generations are quantitatively measured with both mathematical based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.

In this article, you learn how to run evaluators on a single row of data, a larger test dataset on an application target with built-in evaluators using the Azure AI evaluation SDK locally, then track the results and evaluation logs in Azure AI project.

Getting started

First install the evaluators package from Azure AI evaluation SDK:

pip install azure-ai-evaluation

Note

For more detailed information, see the API Reference Documentation for Azure AI Evaluation SDK.

Built-in evaluators

Category	Evaluators
General purpose	`CoherenceEvaluator`, `FluencyEvaluator`, `QAEvaluator`
Textual similarity	`SimilarityEvaluator`, `F1ScoreEvaluator`, `BleuScoreEvaluator`, `GleuScoreEvaluator`, `RougeScoreEvaluator`, `MeteorScoreEvaluator`
Retrieval-Augmented Generation (RAG)	`RetrievalEvaluator`, `DocumentRetrievalEvaluator`, `GroundednessEvaluator`, `GroundednessProEvaluator`, `RelevanceEvaluator`, `ResponseCompletenessEvaluator`
Risk and safety	`ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator`, `UngroundedAttributesEvaluator`, `CodeVulnerabilityEvaluator`, `ContentSafetyEvaluator`
Agentic	`IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, `TaskAdherenceEvaluator`
Azure OpenAI	`AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader`

Built-in quality and safety metrics take in query and response pairs, along with additional information for specific evaluators.

Data requirements for built-in evaluators

Built-in evaluators can accept either query and response pairs or a list of conversations in jsonl format or both.

Conversation and single-turn support for text	Conversation and single-turn support for text and image	Single-turn support for text only
`GroundednessEvaluator`, `GroundednessProEvaluator`, `RetrievalEvaluator`, `DocumentRetrievalEvaluator`,`RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `ResponseCompletenessEvaluator`, `IndirectAttackEvaluator`, `AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGrader`, `AzureOpenAIGrader`	`ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `ContentSafetyEvaluator`	`UngroundedAttributesEvaluator`, `CodeVulnerabilityEvaluator`, `ResponseCompletenessEvaluator`, `SimilarityEvaluator`, `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`, `QAEvaluator`

Note

AI-assisted quality evaluators except for SimilarityEvaluator come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they consume more token usage in generation as a result of improved evaluation quality. Specifically, max_token for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for RetrievalEvaluator to accommodate for longer inputs.)

Note

Azure OpenAI graders require a template that describes how their input columns are turned into the 'real' input that the grader uses. Example: If you have two inputs called "query" and "response", and a template that was formatted like so: {{item.query}}, then only the query would be used. Similarly you could have something like {{item.conversation}} to accept a conversation input, but the ability of the system to handle that will depend on how you configure the rest of the grader to expect that input.

For more information on data requirements for agentic evaluators, go to Run agent evaluations locally with Azure AI Evaluation SDK.

Single-turn support for text

All built-in evaluators take single-turn inputs as in query-and-response pairs in strings, for example:

from azure.ai.evaluation import RelevanceEvaluator

query = "What is the cpital of life?"
response = "Paris."

# Initializing an evaluator
relevance_eval = RelevanceEvaluator(model_config)
relevance_eval(query=query, response=response)

To run batch evaluations using local evaluation or upload your dataset to run cloud evaluation, you need to represent the dataset in .jsonl format. The previous single-turn data (a query-and-response pair) is equivalent to a line of dataset as following (we show three lines as an example):

{"query":"What is the capital of France?","response":"Paris."}
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
{"query":"What color is my shirt?","response":"Blue."}

The evaluation test dataset can contain the following, depending on the requirements of each built-in evaluator:

Query: the query sent in to the generative AI application
Response: the response to the query generated by the generative AI application
Context: the source on which generated response is based (that is, the grounding documents)
Ground truth: the response generated by user/human as the true answer

To see what each evaluator requires, you can learn more in the Built-in evaluators documents.

Conversation support for text

For evaluators that support conversations for text, you can provide conversation as input, a Python dictionary with a list of messages (which include content, role, and optionally context).

Example of a two-turn conversation in python:

conversation = {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": None
        }
        ]
}

To run batch evaluations using local evaluation or upload your dataset to run cloud evaluation, you need to represent the dataset in .jsonl format. The previous conversation is equivalent to a line of dataset as following in a .jsonl file:

{"conversation":
    {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": null
        }
        ]
    }
}

Our evaluators understand that the first turn of the conversation provides valid query from user, context from assistant, and response from assistant in the query-response format. Conversations are then evaluated per turn and results are aggregated over all turns for a conversation score.

Note

In the second turn, even if context is null or a missing key, it's interpreted as an empty string instead of erroring out, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.

For conversation mode, here's an example for GroundednessEvaluator:

# Conversation mode
import json
import os
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

# Initializing Groundedness and Groundedness Pro evaluators
groundedness_eval = GroundednessEvaluator(model_config)

conversation = {
    "messages": [
        { "content": "Which tent is the most waterproof?", "role": "user" },
        { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." },
        { "content": "How much does it cost?", "role": "user" },
        { "content": "$120.", "role": "assistant", "context": "The Alpine Explorer Tent is $120."}
    ]
}

# alternatively, you can load the same content from a .jsonl file
groundedness_conv_score = groundedness_eval(conversation=conversation)
print(json.dumps(groundedness_conv_score, indent=4))

For conversation outputs, per-turn results are stored in a list and the overall conversation score 'groundedness': 4.0 is averaged over the turns:

{
    "groundedness": 5.0,
    "gpt_groundedness": 5.0,
    "groundedness_threshold": 3.0,
    "evaluation_per_turn": {
        "groundedness": [
            5.0,
            5.0
        ],
        "gpt_groundedness": [
            5.0,
            5.0
        ],
        "groundedness_reason": [
            "The response accurately and completely answers the query by stating that the Alpine Explorer Tent is the most waterproof, which is directly supported by the context. There are no irrelevant details or incorrect information present.",
            "The RESPONSE directly answers the QUERY with the exact information provided in the CONTEXT, making it fully correct and complete."
        ],
        "groundedness_result": [
            "pass",
            "pass"
        ],
        "groundedness_threshold": [
            3,
            3
        ]
    }
}

Note

We strongly recommend users to migrate their code to use the key without prefixes (for example, groundedness.groundedness) to allow your code to support more evaluator models.

For evaluators that support conversations for image and multi-modal image and text, you can pass in image URLs or base64 encoded images in conversation.

Following are the examples of supported scenarios:

Multiple images with text input to image or text generation
Text only input to image generations
Image only inputs to text generation

from pathlib import Path
from azure.ai.evaluation import ContentSafetyEvaluator
import base64

# instantiate an evaluator with image and multi-modal support
safety_evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)

# example of a conversation with an image URL
conversation_image_url = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}

# example of a conversation with base64 encoded images
base64_image = ""

with Path.open("Image1.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

conversation_base64 = {
    "messages": [
        {"content": "create an image of a branded apple", "role": "user"},
        {
            "content": [{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}}],
            "role": "assistant",
        },
    ]
}

# run the evaluation on the conversation to output the result
safety_score = safety_evaluator(conversation=conversation_image_url)

Currently the image and multi-modal evaluators support:

Single turn only (a conversation can have only one user message and one assistant message)
Conversation can have only one system message
Conversation payload should be less than 10-MB size (including images)
Absolute URLs and Base64 encoded images
Multiple images in a single turn
JPG/JPEG, PNG, GIF file formats

Set up

For AI-assisted quality evaluators except for GroundednessProEvaluator (preview), you must specify a GPT model (gpt-35-turbo, gpt-4, gpt-4-turbo, gpt-4o, or gpt-4o-mini) in your model_config to act as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that aren't in preview for the best performance and parseable responses with our evaluators.

Note

It's strongly recommended that gpt-3.5-turbo should be replaced by gpt-4o-mini for your evaluator model, as the latter is cheaper, more capable, and just as fast according to OpenAI.

Make sure that you have at least Cognitive Services OpenAI User role for the Azure OpenAI resource to make inference calls with API key. To learn more about permissions, see permissions for Azure OpenAI resource.

For all Risk and Safety Evaluators and GroundednessProEvaluator (preview), instead of a GPT deployment in model_config, you must provide your azure_ai_project information. This accesses the backend evaluation service via your Azure AI project.

Prompts for AI-assisted built-in evaluators

We open-source the prompts of our quality evaluators in our Evaluator Library and Azure AI Evaluation Python SDK repository for transparency, except for the Safety Evaluators and GroundednessProEvaluator (powered by Azure AI Content Safety). These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics. We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in Custom Evaluators.

Composite evaluators

Composite evaluators are built in evaluators that combine the individual quality or safety metrics to easily provide a wide range of metrics right out of the box for both query response pairs or chat messages.

Composite evaluator	Contains	Description
`QAEvaluator`	`GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `F1ScoreEvaluator`	Combines all the quality evaluators for a single output of combined metrics for query and response pairs
`ContentSafetyEvaluator`	`ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`	Combines all the safety evaluators for a single output of combined metrics for query and response pairs

Local evaluation on test datasets using `evaluate()`

After you spot-check your built-in or custom evaluators on a single row of data, you can combine multiple evaluators with the evaluate() API on an entire test dataset.

Prerequisite set up steps for Azure AI Foundry Projects

If this is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do a few additional setup steps.

Create and connect your storage account to your Azure AI Foundry project at the resource level. This bicep template provisions and connects a storage account to your Foundry project with key authentication.
Make sure the connected storage account has access to all projects.
If you connected your storage account with Microsoft Entra ID, make sure to give MSI (Microsoft Identity) permissions for Storage Blob Data Owner to both your account and Foundry project resource in Azure portal.

Evaluate on a dataset and log results to Azure AI Foundry

In order to ensure the evaluate() can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that are accepted by the evaluators. In this case, we specify the data mapping for query, response, and context.

from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length
    },
    # column mapping
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${data.context}",
                "response": "${data.response}"
            } 
        }
    },
    # Optionally provide your Azure AI Foundry project information to track your evaluation results in your project portal
    azure_ai_project = azure_ai_project,
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
    output_path="./myevalresults.json"
)

Tip

Get the contents of the result.studio_url property for a link to view your logged evaluation results in your Azure AI project.

The evaluator outputs results in a dictionary which contains aggregate metrics and row-level data and metrics. An example of an output:

{'metrics': {'answer_length.value': 49.333333333333336,
             'groundedness.gpt_groundeness': 5.0, 'groundedness.groundeness': 5.0},
 'rows': [{'inputs.response': 'Paris is the capital of France.',
           'inputs.context': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.query': 'What is the capital of France?',
           'outputs.answer_length.value': 31,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
                                  '1915.',
           'inputs.query': 'Who developed the theory of relativity?',
           'outputs.answer_length.value': 51,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'The speed of light is approximately 299,792,458 '
                            'meters per second.',
           'inputs.context': 'The exact speed of light in a vacuum is '
                                  '299,792,458 meters per second, a constant '
                                  "used in physics to represent 'c'.",
           'inputs.query': 'What is the speed of light?',
           'outputs.answer_length.value': 66,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'}],
 'traces': {}}

Requirements for `evaluate()`

The evaluate() API has a few requirements for the data format that it accepts and how it handles evaluator parameter key names so that the charts of the evaluation results in your Azure AI project show up properly.

Data format

The evaluate() API only accepts data in the JSONLines format. For all built-in evaluators, evaluate() requires data in the following format with required input fields. See the previous section on required data input for built-in evaluators. Sample of one line can look like:

{
  "query":"What is the capital of France?",
  "context":"France is in Europe",
  "response":"Paris is the capital of France.",
  "ground_truth": "Paris"
}

Evaluator parameter format

When passing in your built-in evaluators, it's important to specify the right keyword mapping in the evaluators parameter list. The following table is the keyword mapping required for the results from your built-in evaluators to show up in the UI when logged to your Azure AI project.

Evaluator	keyword param
`GroundednessEvaluator`	"groundedness"
`GroundednessProEvaluator`	"groundedness_pro"
`RetrievalEvaluator`	"retrieval"
`RelevanceEvaluator`	"relevance"
`CoherenceEvaluator`	"coherence"
`FluencyEvaluator`	"fluency"
`SimilarityEvaluator`	"similarity"
`F1ScoreEvaluator`	"f1_score"
`RougeScoreEvaluator`	"rouge"
`GleuScoreEvaluator`	"gleu"
`BleuScoreEvaluator`	"bleu"
`MeteorScoreEvaluator`	"meteor"
`ViolenceEvaluator`	"violence"
`SexualEvaluator`	"sexual"
`SelfHarmEvaluator`	"self_harm"
`HateUnfairnessEvaluator`	"hate_unfairness"
`IndirectAttackEvaluator`	"indirect_attack"
`ProtectedMaterialEvaluator`	"protected_material"
`CodeVulnerabilityEvaluator`	"code_vulnerability"
`UngroundedAttributesEvaluator`	"ungrounded_attributes"
`QAEvaluator`	"qa"
`ContentSafetyEvaluator`	"content_safety"

Here's an example of setting the evaluators parameters:

result = evaluate(
    data="data.jsonl",
    evaluators={
        "sexual":sexual_evaluator
        "self_harm":self_harm_evaluator
        "hate_unfairness":hate_unfairness_evaluator
        "violence":violence_evaluator
    }
)

Local evaluation on a target

If you have a list of queries that you'd like to run then evaluate, the evaluate() also supports a target parameter, which can send queries to an application to collect answers then run your evaluators on the resulting query and response.

A target can be any callable class in your directory. In this case, we have a Python script askwiki.py with a callable class askwiki() that we can set as our target. Given a dataset of queries we can send into our simple askwiki app, we can evaluate the groundedness of the outputs. Ensure you specify the proper column mapping for your data in "column_mapping". You can use "default" to specify column mapping for all evaluators.

Here's the content in "data.jsonl":

{"query":"When was United Stated found ?", "response":"1776"}
{"query":"What is the capital of France?", "response":"Paris"}
{"query":"Who is the best tennis player of all time ?", "response":"Roger Federer"}

from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "groundedness": groundedness_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}"
                "context": "${outputs.context}"
                "response": "${outputs.response}"
            } 
        }
    }
)

Share via

Evaluate your Generative AI application locally with the Azure AI Evaluation SDK

Getting started

Built-in evaluators

Data requirements for built-in evaluators

Single-turn support for text

Conversation support for text

Conversation support for images and multi-modal text and image

Set up

Prompts for AI-assisted built-in evaluators

Composite evaluators

Local evaluation on test datasets using evaluate()

Prerequisite set up steps for Azure AI Foundry Projects

Evaluate on a dataset and log results to Azure AI Foundry

Requirements for evaluate()

Data format

Evaluator parameter format

Local evaluation on a target

Related content

Feedback

Additional resources

Local evaluation on test datasets using `evaluate()`

Requirements for `evaluate()`