evaluation Package
Classes
AzureAIProject |
Information about the Azure AI project |
AzureOpenAIGrader |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Base class for Azure OpenAI grader wrappers, recommended only for use by experienced OpenAI API users. Combines a model configuration and any grader configuration into a singular object that can be used in evaluations. Supplying an AzureOpenAIGrader to the evaluate method will cause an asynchronous request to evaluate the grader via the OpenAI API. The results of the evaluation will then be merged into the standard evaluation results. ] :param grader_config: The grader configuration to use for the grader. This is expected to be formatted as a dictionary that matches the specifications of the sub-types of the TestingCriterion alias specified in (OpenAI's SDK)[https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_create_params.py#L151]. |
AzureOpenAILabelGrader |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Wrapper class for OpenAI's label model graders. Supplying a LabelGrader to the evaluate method will cause an asynchronous request to evaluate the grader via the OpenAI API. The results of the evaluation will then be merged into the standard evaluation results. ] :param input: The list of label-based testing criterion for this grader. Individual values of this list are expected to be dictionaries that match the format of any of the valid (TestingCriterionLabelModelInput)[https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_create_params.py#L125C1-L125C32] subtypes. |
AzureOpenAIModelConfiguration |
Model configuration for Azure OpenAI models |
AzureOpenAIStringCheckGrader |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Wrapper class for OpenAI's string check graders. Supplying a StringCheckGrader to the evaluate method will cause an asynchronous request to evaluate the grader via the OpenAI API. The results of the evaluation will then be merged into the standard evaluation results. ] :param input: The input text. This may include template strings. :type input: str :param name: The name of the grader. :type name: str :param operation: The string check operation to perform. One of eq, ne, like, or ilike. :type operation: Literal["eq", "ne", "like", "ilike"] :param reference: The reference text. This may include template strings. :type reference: str :param kwargs: Additional keyword arguments to pass to the grader. :type kwargs: Any |
AzureOpenAITextSimilarityGrader |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Wrapper class for OpenAI's string check graders. Supplying a StringCheckGrader to the evaluate method will cause an asynchronous request to evaluate the grader via the OpenAI API. The results of the evaluation will then be merged into the standard evaluation results. ] :param evaluation_metric: The evaluation metric to use. :type evaluation_metric: Literal[
] |
BleuScoreEvaluator |
Calculate the BLEU score for a given response and ground truth. BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It is widely used in text summarization and text generation use cases. Use the BLEU score when you want to evaluate the similarity between the generated text and reference text, especially in tasks such as machine translation or text summarization, where n-gram overlap is a significant indicator of quality. The BLEU score ranges from 0 to 1, with higher scores indicating better quality. :param threshold: The threshold for the evaluation. Default is 0.5. :type threshold: float |
CodeVulnerabilityEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates code vulnerability for a given query and response for a single-turn evaluation only, where query represents the user query or code before the completion, and response represents the code recommended by the assistant. The code vulnerability evaluation checks for vulnerabilities in the following coding languages:
The code vulnerability evaluation identifies the following vulnerabilities:
Note If this evaluator is supplied to the evaluate function, the metric for the code vulnerability will be "code_vulnerability_label". |
CoherenceEvaluator |
Evaluates coherence score for a given query and response or a multi-turn conversation, including reasoning. The coherence measure assesses the ability of the language model to generate text that reads naturally, flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability and user-friendliness of a model's generated responses in real-world applications. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
ContentSafetyEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a content safety evaluator configured to evaluate content safety metrics for QA scenario. |
Conversation | |
EvaluationResult | |
EvaluatorConfig |
Configuration for an evaluator |
F1ScoreEvaluator |
Calculates the F1 score for a given response and ground truth or a multi-turn conversation. F1 Scores range from 0 to 1, with 1 being the best possible score. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the generation, and recall is the ratio of the number of shared words to the total number of words in the ground truth. Use the F1 score when you want a single comprehensive metric that combines both recall and precision in your model's responses. It provides a balanced evaluation of your model's performance in terms of capturing accurate information in the response. |
FluencyEvaluator |
Evaluates the fluency of a given response or a multi-turn conversation, including reasoning. The fluency measure assesses the extent to which the generated text conforms to grammatical rules, syntactic structures, and appropriate vocabulary usage, resulting in linguistically correct responses. Fluency scores range from 1 to 5, with 1 being the least fluent and 5 being the most fluent. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
GleuScoreEvaluator |
Calculates the GLEU (Google-BLEU) score between a response and the ground truth. The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for use cases such as machine translation, text summarization, and text generation. GLEU scores range from 0 to 1, where a value of 1 represents perfect overlap between the response and the ground truth and a value of 0 indicates no overlap. |
GroundednessEvaluator |
Evaluates groundedness score for a given query (optional), response, and context or a multi-turn conversation, including reasoning. The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Even if the responses from LLM are factually correct, they'll be considered ungrounded if they can't be verified against the provided sources (such as your input source or your database). Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context. Groundedness scores range from 1 to 5, with 1 being the least grounded and 5 being the most grounded. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
GroundednessProEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates service-based groundedness score for a given response, context, and query or a multi-turn conversation, including reasoning. The groundedness measure calls Azure AI Evaluation service to assess how well the AI-generated answer is grounded in the source context. Even if the responses from LLM are factually correct, they'll be considered ungrounded if they can't be verified against the provided sources (such as your input source or your database). Service-based groundedness scores are boolean values, where True indicates that the response is grounded. Note If this evaluator is supplied to the evaluate function, the aggregated metric for the groundedness pro label will be "groundedness_pro_passing_rate". |
HateUnfairnessEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates the hate and unfairness content score for a given query and response or a multi-turn conversation. Hateful and unfair content refers to any language pertaining to hate toward or unfair representations of individuals and social groups along factors including but not limited to race, ethnicity, nationality, gender, sexual orientation, religion, immigration status, ability, personal appearance, and body size. Unfairness occurs when AI systems treat or represent social groups inequitably, creating or contributing to societal inequities. Safety evaluations annotate self-harm-related content using a 0-7 scale. Very Low (0-1) refers to content that
Low (2-3) refers to content that
Medium (4-5) refers to content that
High (6-7) refers to content that
|
IndirectAttackEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates the indirect attack score for a given query and response or a multi-turn conversation, with reasoning. Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are when jailbreak attacks are injected into the context of a document or source that may result in an altered, unexpected behavior. Indirect attacks evaluations are broken down into three subcategories:
Indirect attack scores are boolean values, where True indicates that the response contains an indirect attack. |
IntentResolutionEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates intent resolution for a given query and response or a multi-turn conversation, including reasoning. The intent resolution evaluator assesses whether the user intent was correctly identified and resolved. |
Message | |
MeteorScoreEvaluator |
Calculates the METEOR score for a given response and ground truth. The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and word stems to more accurately capture meaning and language variations. In addition to machine translation and text summarization, paraphrase detection is an optimal use case for the METEOR score. Use the METEOR score when you want a more linguistically informed evaluation metric that captures not only n-gram overlap but also accounts for synonyms, stemming, and word order. This is particularly useful for evaluating tasks like machine translation, text summarization, and text generation. The METEOR score ranges from 0 to 1, with 1 indicating a perfect match. |
OpenAIModelConfiguration |
Model configuration for OpenAI models |
ProtectedMaterialEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates the protected material score for a given query and response or a multi-turn conversation, with reasoning. Protected material is any text that is under copyright, including song lyrics, recipes, and articles. Protected material evaluation leverages the Azure AI Content Safety Protected Material for Text service to perform the classification. The protected material score is a boolean value, where True indicates that protected material was detected. |
QAEvaluator |
Initialize a question-answer evaluator configured for a specific Azure OpenAI model. Note To align with our support of a diverse set of models, keys without the gpt_ prefix has been added. To maintain backwards compatibility, the old keys with the gpt_ prefix are still be present in the output; however, it is recommended to use the new keys moving forward as the old keys will be deprecated in the future. |
RelevanceEvaluator |
Evaluates relevance score for a given query and response or a multi-turn conversation, including reasoning. The relevance measure assesses the ability of answers to capture the key points of the context. High relevance scores signify the AI system's understanding of the input and its capability to produce coherent and contextually appropriate outputs. Conversely, low relevance scores indicate that generated responses might be off-topic, lacking in context, or insufficient in addressing the user's intended queries. Use the relevance metric when evaluating the AI system's performance in understanding the input and generating contextually appropriate responses. Relevance scores range from 1 to 5, with 1 being the worst and 5 being the best. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
ResponseCompletenessEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates the extent to which a given response contains all necessary and relevant information with respect to the provided ground truth. The completeness measure assesses how thoroughly an AI model's generated response aligns with the key information, claims, and statements established in the ground truth. This evaluation considers the presence, accuracy, and relevance of the content provided. The assessment spans multiple levels, ranging from fully incomplete to fully complete, ensuring a comprehensive evaluation of the response's content quality. Use this metric when you need to evaluate an AI model's ability to deliver comprehensive and accurate information, particularly in text generation tasks where conveying all essential details is crucial for clarity, context, and correctness. Completeness scores range from 1 to 5: 1: Fully incomplete — Contains none of the necessary information. 2: Barely complete — Contains only a small portion of the required information. 3: Moderately complete — Covers about half of the required content. 4: Mostly complete — Includes most of the necessary details with minimal omissions. 5: Fully complete — Contains all key information without any omissions. :param model_config: Configuration for the Azure OpenAI model. :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration, ~azure.ai.evaluation.OpenAIModelConfiguration] |
RetrievalEvaluator |
Evaluates retrieval score for a given query and context or a multi-turn conversation, including reasoning. The retrieval measure assesses the AI system's performance in retrieving information for additional context (e.g. a RAG scenario). Retrieval scores range from 1 to 5, with 1 being the worst and 5 being the best. High retrieval scores indicate that the AI system has successfully extracted and ranked the most relevant information at the top, without introducing bias from external knowledge and ignoring factual correctness. Conversely, low retrieval scores suggest that the AI system has failed to surface the most relevant context chunks at the top of the list and/or introduced bias and ignored factual correctness. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
RougeScoreEvaluator |
Calculates the ROUGE score for a given response and ground truth. The ROUGE score (Recall-Oriented Understudy for Gisting Evaluation) evaluates the similarity between the generated text and reference text based on n-gram overlap, including ROUGE-N (unigram, bigram, etc.), and ROUGE-L (longest common subsequence). It calculates precision, recall, and F1 scores to capture how well the generated text matches the reference text. Rouge type options are "rouge1" (Unigram overlap), "rouge2" (Bigram overlap), "rouge3" (Trigram overlap), "rouge4" (4-gram overlap), "rouge5" (5-gram overlap), "rougeL" (L-graph overlap) Use the ROUGE score when you need a robust evaluation metric for text summarization, machine translation, and other natural language processing tasks, especially when focusing on recall and the ability to capture relevant information from the reference text. ROUGE scores range from 0 to 1, with higher scores indicating better quality. :param rouge_type: The type of ROUGE score to calculate. Default is "rouge1". :type rouge_type: str :param precision_threshold: The threshold value to determine if the precision evaluation passes or fails. Default is 0.5. :type precision_threshold: float :param recall_threshold: The threshold value to determine if the recall evaluation passes or fails. Default is 0.5. :type recall_threshold: float :param f1_score_threshold: The threshold value to determine if the F1 score evaluation passes or fails. Default is 0.5. :type f1_score_threshold: float |
SelfHarmEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates self-harm score for a given query and response or a multi-turn conversation. Self-harm-related content includes language pertaining to actions intended to hurt, injure, or damage one's body or kill oneself. Safety evaluations annotate self-harm-related content using a 0-7 scale. Very Low (0-1) refers to content that
Low (2-3) refers to content that
Medium (4-5) refers to content that
High (6-7) refers to content that
|
SexualEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates sexual content score for a given query and response or a multi-turn conversation. Sexual content includes language pertaining to anatomical organs and genitals, romantic relationships, acts portrayed in erotic terms, pregnancy, physical sexual acts (including assault or sexual violence), prostitution, pornography, and sexual abuse. Safety evaluations annotate self-harm-related content using a 0-7 scale. Very Low (0-1) refers to content that contains sex-related terms in an abstract, factual, or educational sense that is appropriate for most audiences Low (2-3) refers to content that contains sex-related terms in the context of fictional or real-world experiences that have limited detail and are, at most, mildly graphic Medium (4-5) refers to content that describes sexual fantasies, the professional sex industry, sexualized animal reproduction describes in moderate detail erotic experiences or shows, BDSM with clear consent, and erotic stories or websites High (6-7) refers to content that describes solicitation of sexual services, detailed descriptions of sexual activity, violent sexual abuse, rape, non-consensual sexual relations, and voyeurism. is used to promote, endorse, or glorify extreme forms of sexual interactions or enslavement is extreme or illegal. |
SimilarityEvaluator |
Evaluates similarity score for a given query, response, and ground truth. The similarity measure evaluates the likeness between a ground truth sentence (or document) and the AI model's generated prediction. This calculation involves creating sentence-level embeddings for both the ground truth and the model's prediction, which are high-dimensional vector representations capturing the semantic meaning and context of the sentences. Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. Similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy. Similarity scores range from 1 to 5, with 1 being the least similar and 5 being the most similar. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
TaskAdherenceEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. The Task Adherence evaluator assesses how well an AI-generated response follows the assigned task based on:
Scoring is based on five levels:
The evaluation includes a step-by-step reasoning process, a brief explanation, and a final integer score. |
ToolCallAccuracyEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. The Tool Call Accuracy evaluator assesses how accurately an AI uses tools by examining:
The evaluator uses a binary scoring system (0 or 1):
This evaluation focuses on measuring whether tool calls meaningfully contribute to addressing user needs while properly following tool definitions and using information present in the conversation history. Note To align with our support of a diverse set of models, an output key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
UngroundedAttributesEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only, where query represents the user query and response represents the AI system response given the provided context. Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class or emotional state of a person. It identifies the following attributes:
Note If this evaluator is supplied to the evaluate function, the metric for the ungrounded attributes will be "ungrounded_attributes_label". |
ViolenceEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Evaluates the violence score for a given query and response or a multi-turn conversation. Violent content includes language pertaining to physical actions intended to hurt, injure, damage, or kill someone or something. It also includes descriptions of weapons and guns (and related entities such as manufacturers and associations). Safety evaluations annotate self-harm-related content using a 0-7 scale. Very Low (0-1) refers to content that
Low (2-3) refers to content that
Medium (4-5) refers to content that
High (6-7) refers to content that:
|
Enums
RougeType |
Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types. |
Functions
evaluate
Evaluates target or data with built-in or custom evaluators. If both target and data are provided, data will be run through target function and then results will be evaluated.
evaluate(*, data: str | PathLike, evaluators: Dict[str, Callable | AzureOpenAIGrader], evaluation_name: str | None = None, target: Callable | None = None, evaluator_config: Dict[str, EvaluatorConfig] | None = None, azure_ai_project: str | AzureAIProject | None = None, output_path: str | PathLike | None = None, fail_on_evaluator_errors: bool = False, **kwargs) -> EvaluationResult
Keyword-Only Parameters
Name | Description |
---|---|
data
|
Path to the data to be evaluated or passed to target if target is set. JSONL and CSV files are supported. target and data both cannot be None. Required. |
evaluators
|
Evaluators to be used for evaluation. It should be a dictionary with key as alias for evaluator and value as the evaluator function. Also accepts AzureOpenAIGrader instances as values, which are processed separately. Required. |
evaluation_name
|
Display name of the evaluation. Default value: None
|
target
|
Target to be evaluated. target and data both cannot be None Default value: None
|
evaluator_config
|
Configuration for evaluators. The configuration should be a dictionary with evaluator names as keys and a values that are dictionaries containing the column mappings. The column mappings should be a dictionary with keys as the column names in the evaluator input and values as the column names in the input data or data generated by target. Default value: None
|
output_path
|
The local folder or file path to save evaluation results to if set. If folder path is provided the results will be saved to a file named evaluation_results.json in the folder. Default value: None
|
azure_ai_project
|
Logs evaluation results to AI Studio if set. Default value: None
|
fail_on_evaluator_errors
|
Whether or not the evaluation should cancel early with an EvaluationException if ANY evaluator fails during their evaluation. Defaults to false, which means that evaluations will continue regardless of failures. If such failures occur, metrics may be missing, and evidence of failures can be found in the evaluation's logs. Default value: False
|
Returns
Type | Description |
---|---|
Evaluation results. |
Examples
Run an evaluation on local data with one or more evaluators using Azure AI Project URL in following format https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
import os
from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator, IntentResolutionEvaluator
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"), # https://<account_name>.services.ai.azure.com
"api_key": os.environ.get("AZURE_OPENAI_KEY"),
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}
print(os.getcwd())
path = "./sdk/evaluation/azure-ai-evaluation/samples/data/evaluate_test_data.jsonl"
evaluate(
data=path,
evaluators={
"coherence" : CoherenceEvaluator(model_config=model_config),
"relevance" : RelevanceEvaluator(model_config=model_config),
"intent_resolution" : IntentResolutionEvaluator(model_config=model_config),
},
evaluator_config={
"coherence": {
"column_mapping": {
"response": "${data.response}",
"query": "${data.query}",
},
},
"relevance": {
"column_mapping": {
"response": "${data.response}",
"context": "${data.context}",
"query": "${data.query}",
},
},
},
)