Edit

Share via


Textual similarity evaluators

Note

This article refers to the Microsoft Foundry (classic) portal.

🔄 Switch to the Microsoft Foundry (new) documentation if you're using the new portal.

Note

This article refers to the Microsoft Foundry (new) portal.

Note

The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). The Azure AI Evaluation SDK and evaluators marked (preview) in this article are currently in public preview everywhere.

Note

The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.

It's important to compare how closely the textual response generated by your AI system matches the response you would expect. The expected response is called the ground truth.

Use a LLM-judge metric like Similarity with a focus on the semantic similarity between the generated response and the ground truth. Or, use metrics from the field of natural language processing (NLP), including F1 score, BLEU, GLEU, ROUGE, and METEOR with a focus on the overlaps of tokens or n-grams between the two.

Model configuration for AI-assisted evaluators

For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

Evaluator model support

We support AzureOpenAI or OpenAI reasoning models and non-reasoning models for the LLM-judge depending on the evaluators:

Evaluators Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) To enable
IntentResolution, TaskAdherence, ToolCallAccuracy, ResponseCompleteness, Coherence, Fluency, Similarity, Groundedness, Retrieval, Relevance Supported Supported Set additional parameter is_reasoning_model=True in initializing evaluators
Other evaluators Not Supported Supported --

For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like 4.1-mini with a balance of reasoning performance and cost efficiency.

Similarity

Similarity measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response, instead of simple overlap in tokens or n-grams. It also considers the broader context of a query.

Similarity example

from azure.ai.evaluation import SimilarityEvaluator

similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
similarity(
    query="Is Marie Curie born in Paris?", 
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

Similarity output

The output is a numerical score on a likert scale, integer 1 to 5. A higher score means a higher degree of similarity. Given a numerical threshold (default to 3), this example also outputs pass if the score >= threshold, or fail otherwise. Use the reason field to understand why the score is high or low.

{
    "similarity": 4.0,
    "gpt_similarity": 4.0,
    "similarity_result": "pass",
    "similarity_threshold": 3
}

F1 score

F1 score measures the similarity by shared tokens between the generated text and the ground truth. It focuses on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. The ratio is computed over the individual words in the generated response against those words in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score.

  • Precision is the ratio of the number of shared words to the total number of words in the generation.
  • Recall is the ratio of the number of shared words to the total number of words in the ground truth.

F1 score example

from azure.ai.evaluation import F1ScoreEvaluator

f1_score = F1ScoreEvaluator(threshold=0.5)
f1_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

F1 score output

The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs pass if the score >= threshold, or fail otherwise.

{
    "f1_score": 0.631578947368421,
    "f1_result": "pass",
    "f1_threshold": 0.5
}

BLEU score

Bleu score computes the Bilingual Evaluation Understudy (BLEU) score commonly used in natural language processing and machine translation. It measures how closely the generated text matches the reference text.

BLEU example

from azure.ai.evaluation import BleuScoreEvaluator

bleu_score = BleuScoreEvaluator(threshold=0.3)
bleu_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

BLEU output

The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs pass if the score >= threshold, or fail otherwise.

{
    "bleu_score": 0.1550967560878879,
    "bleu_result": "fail",
    "bleu_threshold": 0.3
}

GLEU score

Gleu score computes the Google-BLEU (GLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth. Similar to the BLEU score, it focuses on both precision and recall. It addresses the drawbacks of the BLEU score using a per-sentence reward objective.

GLEU score example

from azure.ai.evaluation import GleuScoreEvaluator

gleu_score = GleuScoreEvaluator(threshold=0.2)
gleu_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

GLEU score output

The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs pass if the score >= threshold, or fail otherwise.

{
    "gleu_score": 0.25925925925925924,
    "gleu_result": "pass",
    "gleu_threshold": 0.2
}

ROUGE score

Rouge score computes the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.

ROUGE score example

from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L, precision_threshold=0.6, recall_threshold=0.5, f1_score_threshold=0.55) 
rouge(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

ROUGE score output

The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs pass if the score >= threshold, or fail otherwise.

{
    "rouge_precision": 0.46153846153846156,
    "rouge_recall": 1.0,
    "rouge_f1_score": 0.631578947368421,
    "rouge_precision_result": "fail",
    "rouge_recall_result": "pass",
    "rouge_f1_score_result": "pass",
    "rouge_precision_threshold": 0.6,
    "rouge_recall_threshold": 0.5,
    "rouge_f1_score_threshold": 0.55
}

METEOR score

Meteor score measures the similarity by shared n-grams between the generated text and the ground truth. Similar to the BLEU score, it focuses on precision and recall. It addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.

METEOR score example

from azure.ai.evaluation import MeteorScoreEvaluator

meteor_score = MeteorScoreEvaluator(threshold=0.9)
meteor_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

METEOR score output

The numerical score is a 0-1 float. A higher score is better. Given a numerical threshold (default to 0.5), it also outputs pass if the score >= threshold, or fail otherwise.

{
    "meteor_score": 0.8621140763997908,
    "meteor_result": "fail",
    "meteor_threshold": 0.9
}

Using textual similarity evaluators

Textual similarity evaluators compare generated responses against ground truth text using different algorithms:

  • Similarity - LLM-based semantic similarity evaluation
  • F1, BLEU, GLEU, ROUGE, METEOR - Algorithmic token/n-gram overlap metrics

Examples:

Evaluator What it measures Required inputs Required parameters Output Default threshold
builtin.similarity Semantic similarity to ground truth query, response, ground_truth deployment_name 1-5 integer 3
builtin.f1_score Token overlap using precision and recall ground_truth, response (none) 0-1 float 0.5
builtin.bleu_score N-gram overlap (machine translation metric) ground_truth, response (none) 0-1 float 0.5
builtin.gleu_score Per-sentence reward variant of BLEU ground_truth, response (none) 0-1 float 0.5
builtin.rouge_score Recall-oriented n-gram overlap ground_truth, response rouge_type 0-1 float 0.5
builtin.meteor_score Weighted alignment with synonyms ground_truth, response (none) 0-1 float 0.5

Example input

Your test dataset should contain the fields referenced in your data mappings:

{"query": "What is the largest city in France?", "response": "Paris is the largest city in France.", "ground_truth": "The largest city in France is Paris."}
{"query": "Explain machine learning.", "response": "Machine learning is a subset of AI that enables systems to learn from data.", "ground_truth": "Machine learning is an AI technique where computers learn patterns from data."}

Configuration example

Data mapping syntax:

  • {{item.field_name}} references fields from your test dataset (for example, {{item.response}}).
testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "Similarity",
        "evaluator_name": "builtin.similarity",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{item.response}}",
            "ground_truth": "{{item.ground_truth}}",
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "BLEUScore",
        "evaluator_name": "builtin.bleu_score",
        "data_mapping": {
            "ground_truth": "{{item.ground_truth}}",
            "response": "{{item.response}}",
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "ROUGEScore",
        "evaluator_name": "builtin.rouge_score",
        "initialization_parameters": {"rouge_type": "rouge1"},
        "data_mapping": {
            "ground_truth": "{{item.ground_truth}}",
            "response": "{{item.response}}",
        },
    },
]

See Run evaluations in the cloud for details on running evaluations and configuring data sources.

Example output

LLM-based evaluators like similarity use a 1-5 Likert scale. Algorithmic evaluators output 0-1 floats. All evaluators output pass or fail based on their thresholds. Key output fields:

{
    "type": "azure_ai_evaluator",
    "name": "Similarity",
    "metric": "similarity",
    "score": 4,
    "label": "pass",
    "reason": "The response accurately conveys the same meaning as the ground truth.",
    "threshold": 3,
    "passed": true
}