Evaluation and monitoring metrics for generative AI

Note

Azure AI Studio is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Azure AI Studio allows you to evaluate single-turn or complex, multi-turn conversations where you ground the generative AI model in your specific data (also known as Retrieval Augmented Generation or RAG). You can also evaluate general single-turn question answering scenarios, where no context is used to ground your generative AI model (non-RAG). Currently, we support built-in metrics for the following task types:

Question answering (single turn)

In this setup, users pose individual questions or prompts, and a generative AI model is employed to instantly generate responses.

The test set format will follow this data format:

{"question":"Which tent is the most waterproof?","context":"From our product list, the Alpine Explorer tent is the most waterproof. The Adventure Dining Table has higher weight.","answer":"The Alpine Explorer Tent is the most waterproof.","ground_truth":"The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"} 

Note

The "context" and "ground truth" fields are optional, and the supported metrics depend on the fields you provide

Conversation (single turn and multi turn)

In this context, users engage in conversational interactions, either through a series of turns or in a single exchange. The generative AI model, equipped with retrieval mechanisms, generates responses and can access and incorporate information from external sources, such as documents. The Retrieval Augmented Generation (RAG) model enhances the quality and relevance of responses by using external documents and knowledge.

The test set format will follow this data format:

{"messages":[{"role":"user","content":"How can I check the status of my online order?"},{"content":"Hi Sarah Lee! To check the status of your online order for previous purchases such as the TrailMaster X4 Tent or the CozyNights Sleeping Bag, please refer to your email for order confirmation and tracking information. If you need further assistance, feel free to contact our customer support at support@contosotrek.com or give us a call at 1-800-555-1234.
","role":"assistant","context":{"citations":[{"id":"cHJvZHVjdF9pbmZvXzYubWQz","title":"Information about product item_number: 6","content":"# Information about product item_number: 6\n\nIt's essential to check local regulations before using the EcoFire Camping Stove, as some areas may have restrictions on open fires or require a specific type of stove.\n\n30) How do I clean and maintain the EcoFire Camping Stove?\n   To clean the EcoFire Camping Stove, allow it to cool completely, then wipe away any ash or debris with a brush or cloth. Store the stove in a dry place when not in use."}]}}]}

Supported metrics

As described in the methods for evaluating large language models, there are manual and automated approaches to measurement. Automated measurement is useful for measuring at scale with increased coverage to provide more comprehensive results. It's also helpful for ongoing measurement to monitor for any regression as the system, usage, and mitigations evolve.

We support two main methods for automated measurement of generative AI applications:

  • Traditional machine learning metrics
  • AI-assisted metrics

AI-assisted metrics utilize language models like GPT-4 to assess AI-generated output, especially in situations where expected answers are unavailable due to the absence of a defined ground truth. Traditional machine learning metrics, like F1 score, gauge the precision and recall between AI-generated responses and the anticipated answers.

Our AI-assisted metrics assess the safety and generation quality of generative AI applications. These metrics fall into two distinct categories:

  • Risk and safety metrics:

    These metrics focus on identifying potential content and security risks and ensuring the safety of the generated content.

    They include:

    • Hateful and unfair content defect rate
    • Sexual content defect rate
    • Violent content defect rate
    • Self-harm-related content defect rate
    • Jailbreak defect rate
  • Generation quality metrics:

    These metrics evaluate the overall quality and coherence of the generated content.

    They include:

    • Coherence
    • Fluency
    • Groundedness
    • Relevance
    • Retrieval score
    • Similarity

We support the following AI-Assisted metrics for the above task types:

Task type Question and Generated Answers Only (No context or ground truth needed) Question and Generated Answers + Context Question and Generated Answers + Context + Ground Truth
Question Answering - Risk and safety metrics (all AI-Assisted): hateful and unfair content defect rate, sexual content defect rate, violent content defect rate, self-harm-related content defect rate, and jailbreak defect rate
- Generation quality metrics (all AI-Assisted): Coherence, Fluency
Previous Column Metrics
+
Generation quality metrics (all AI-Assisted):
- Groundedness
- Relevance
Previous Column Metrics
+
Generation quality metrics:
Similarity (AI-assisted)
F1-Score (traditional ML metric)
Conversation - Risk and safety metrics (all AI-Assisted): hateful and unfair content defect rate, sexual content defect rate, violent content defect rate, self-harm-related content defect rate, and jailbreak defect rate
- Generation quality metrics (all AI-Assisted): Coherence, Fluency
Previous Column Metrics
+
Generation quality metrics (all AI-Assisted):
- Groundedness
- Retrieval Score
N/A

Note

While we are providing you with a comprehensive set of built-in metrics that facilitate the easy and efficient evaluation of the quality and safety of your generative AI application, it is best practice to adapt and customize them to your specific task types. Furthermore, we empower you to introduce entirely new metrics, enabling you to measure your applications from fresh angles and ensuring alignment with your unique objectives.

Risk and safety metrics

The risk and safety metrics draw on insights gained from our previous Large Language Model projects such as GitHub Copilot and Bing. This ensures a comprehensive approach to evaluating generated responses for risk and safety severity scores. These metrics are generated through our safety evaluation service, which employs a set of LLMs. Each model is tasked with assessing specific risks that could be present in the response (for example, sexual content, violent content, etc.). These models are provided with risk definitions and severity scales, and they annotate generated conversations accordingly. Currently, we calculate a “defect rate” for the risk and safety metrics below. For each of these metrics, the service measures whether these types of content were detected and at what severity level. Each of the four types has three severity levels (Very low, Low, Medium, High). Users specify a threshold of tolerance, and the defect rates are produced by our service correspond to the number of instances that were generated at and above each threshold level.

Types of content:

  • Hateful and unfair content
  • Sexual content
  • Violent content
  • Self-harm-related content

Besides the above types of contents, we also support “Jailbreak defect rate” in a comparative view across evaluations, a metric that measures the prevalence of jailbreaks in model responses. Jailbreaks are when a model response bypasses the restrictions placed on it. Jailbreak also happens where an LLM deviates from the intended task or topic.

Users can measure these risk and safety metrics on their own data or use the Azure AI SDK to simulate different attack interactions with their generative AI application to output a test (we refer to it as content risk dataset). Then you can evaluate on this simulated test dataset to output an annotated test dataset with content risk severity levels (very low, low, medium, or high) and view your results in Azure AI , which provides you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.

Unlike other metrics in the table, jailbreak vulnerability can't be reliably measured with annotation by an LLM. However, jailbreak vulnerability can be measured by the comparison of two different automated datasets (1) content risk dataset vs. (2) content risk dataset with jailbreak injections in the first turn. Then the user evaluates jailbreak vulnerability by comparing the two datasets’ content risk defect rates.

Note

AI-assisted risk and safety metrics are hosted by Azure AI Studio safety evaluations back-end service and is only available in the following regions: East US 2, France Central, UK South, Sweden Central.

Available regions have the following capacity:

Region TPM
Sweden Central 450k
France Central 380k
UK South 280k
East US 2 80k

Hateful and unfair content definition and severity scale

Warning

The content risk definitions and severity scales contain descriptions that may be disturbing to some users.

Sexual content definition and severity scale

Warning

The content risk definitions and severity scales contain descriptions that may be disturbing to some users.

Violent content definition and severity scale

Warning

The content risk definitions and severity scales contain descriptions that may be disturbing to some users.

Warning

The content risk definitions and severity scales contain descriptions that may be disturbing to some users.

Generation quality metrics

Generation quality metrics are used to assess the overall quality of the content produced by generative AI applications. Here's a breakdown of what these metrics entail:

AI-assisted: Groundedness

For groundedness, we provide two versions:

  • Groundedness Detection leveraging Azure AI Content Safety Service (AACS) via integration into the Azure AI Studio safety evaluations. No deployment is required from the user as a back-end service will provide the models for you to output a score and reasoning. Currently supported in the following regions: East US 2 and Sweden Central.
  • Prompt-only-based Groundedness using your own models to output only a score. Currently supported in all regions.

AACS based groundedness

Score characteristics Score details
Score range 1-5 where 1 is ungrounded and 5 is grounded
What is this metric? Measures how well the model's generated answers align with information from the source data (for example, retrieved documents in RAG Question and Answering or documents for summarization) and outputs reasonings for which specific generated sentences are ungrounded.
How does it work? Groundedness Detection leverages an Azure AI Content Safety Service custom language model fine-tuned to a natural language processing task called Natural Language Inference (NLI), which evaluates claims as being entailed or not entailed by a source document. 
When to use it? Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context. It's essential for applications where factual correctness and contextual accuracy are key, like information retrieval, question-answering, and content summarization. This metric ensures that the AI-generated answers are well-supported by the context.
What does it need as input? Question, Context, Generated Answer

Prompt-only-based groundedness

Score characteristics Score details
Score range 1-5 where 1 is ungrounded and 5 is grounded
What is this metric? Measures how well the model's generated answers align with information from the source data (user-defined context).
How does it work? The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Even if the responses from LLM are factually correct, they'll be considered ungrounded if they can't be verified against the provided sources (such as your input source or your database).
When to use it? Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context. It's essential for applications where factual correctness and contextual accuracy are key, like information retrieval, question-answering, and content summarization. This metric ensures that the AI-generated answers are well-supported by the context.
What does it need as input? Question, Context, Generated Answer

Built-in prompt used by Large Language Model judge to score this metric:

You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating: 

1. 5: The ANSWER follows logically from the information contained in the CONTEXT. 

2. 1: The ANSWER is logically false from the information contained in the CONTEXT. 

3. an integer score between 1 and 5 and if such integer score does not exist,  

use 1: It is not possible to determine whether the ANSWER is true or false without further information. 

Read the passage of information thoroughly and select the correct answer from the three answer labels. 

Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.  

Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation. 

AI-assisted: Relevance

Score characteristics Score details
Score range Integer [1-5]: where 1 is bad and 5 is good
What is this metric? Measures the extent to which the model's generated responses are pertinent and directly related to the given questions.
How does it work? The relevance measure assesses the ability of answers to capture the key points of the context. High relevance scores signify the AI system's understanding of the input and its capability to produce coherent and contextually appropriate outputs. Conversely, low relevance scores indicate that generated responses might be off-topic, lacking in context, or insufficient in addressing the user's intended queries.
When to use it? Use the relevance metric when evaluating the AI system's performance in understanding the input and generating contextually appropriate responses.
What does it need as input? Question, Context, Generated Answer

Built-in prompt used by Large Language Model judge to score this metric (For question answering data format):

Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale: 

One star: the answer completely lacks relevance 

Two stars: the answer mostly lacks relevance 

Three stars: the answer is partially relevant 

Four stars: the answer is mostly relevant 

Five stars: the answer has perfect relevance 

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. 

Built-in prompt used by Large Language Model judge to score this metric (For conversation data format) (without Ground Truth available):

You will be provided a question, a conversation history, fetched documents related to the question and a response to the question in the {DOMAIN} domain. Your task is to evaluate the quality of the provided response by following the steps below:  
 
- Understand the context of the question based on the conversation history.  
 
- Generate a reference answer that is only based on the conversation history, question, and fetched documents. Don't generate the reference answer based on your own knowledge.  
 
- You need to rate the provided response according to the reference answer if it's available on a scale of 1 (poor) to 5 (excellent), based on the below criteria:  
 
5 - Ideal: The provided response includes all information necessary to answer the question based on the reference answer and conversation history. Please be strict about giving a 5 score.  
 
4 - Mostly Relevant: The provided response is mostly relevant, although it might be a little too narrow or too broad based on the reference answer and conversation history.  
 
3 - Somewhat Relevant: The provided response might be partly helpful but might be hard to read or contain other irrelevant content based on the reference answer and conversation history.  
 
2 - Barely Relevant: The provided response is barely relevant, perhaps shown as a last resort based on the reference answer and conversation history.  
 
1 - Completely Irrelevant: The provided response should never be used for answering this question based on the reference answer and conversation history.  
 
- You need to rate the provided response to be 5, if the reference answer can not be generated since no relevant documents were retrieved.  
 
- You need to first provide a scoring reason for the evaluation according to the above criteria, and then provide a score for the quality of the provided response.  
 
- You need to translate the provided response into English if it's in another language. 

- Your final response must include both the reference answer and the evaluation result. The evaluation result should be written in English.  

Built-in prompt used by Large Language Model judge to score this metric (For conversation data format) (with Ground Truth available):


Your task is to score the relevance between a generated answer and the question based on the ground truth answer in the range between 1 and 5, and please also provide the scoring reason.  
 
Your primary focus should be on determining whether the generated answer contains sufficient information to address the given question according to the ground truth answer.   
 
If the generated answer fails to provide enough relevant information or contains excessive extraneous information, then you should reduce the score accordingly.  
 
If the generated answer contradicts the ground truth answer, it will receive a low score of 1-2.   
 
For example, for question "Is the sky blue?", the ground truth answer is "Yes, the sky is blue." and the generated answer is "No, the sky is not blue.".   
 
In this example, the generated answer contradicts the ground truth answer by stating that the sky is not blue, when in fact it is blue.   
 
This inconsistency would result in a low score of 1-2, and the reason for the low score would reflect the contradiction between the generated answer and the ground truth answer.  
 
Please provide a clear reason for the low score, explaining how the generated answer contradicts the ground truth answer.  
 
Labeling standards are as following:  
 
5 - ideal, should include all information to answer the question comparing to the ground truth answer, and the generated answer is consistent with the ground truth answer  
 
4 - mostly relevant, although it might be a little too narrow or too broad comparing to the ground truth answer, and the generated answer is consistent with the ground truth answer  
 
3 - somewhat relevant, might be partly helpful but might be hard to read or contain other irrelevant content comparing to the ground truth answer, and the generated answer is consistent with the ground truth answer  
 
2 - barely relevant, perhaps shown as a last resort comparing to the ground truth answer, and the generated answer contradicts with the ground truth answer  
 
1 - completely irrelevant, should never be used for answering this question comparing to the ground truth answer, and the generated answer contradicts with the ground truth answer  

AI-assisted: Coherence

Score characteristics Score details
Score range Integer [1-5]: where 1 is bad and 5 is good
What is this metric? Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
How does it work? The coherence measure assesses the ability of the language model to generate text that reads naturally, flows smoothly, and resembles human-like language in its responses.
When to use it? Use it when assessing the readability and user-friendliness of your model's generated responses in real-world applications.
What does it need as input? Question, Generated Answer

Built-in prompt used by Large Language Model judge to score this metric:

Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale: 

One star: the answer completely lacks coherence 

Two stars: the answer mostly lacks coherence 

Three stars: the answer is partially coherent 

Four stars: the answer is mostly coherent 

Five stars: the answer has perfect coherency 

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. 

AI-assisted: Fluency

Score characteristics Score details
Score range Integer [1-5]: where 1 is bad and 5 is good
What is this metric? Measures the grammatical proficiency of a generative AI's predicted answer.
How does it work? The fluency measure assesses the extent to which the generated text conforms to grammatical rules, syntactic structures, and appropriate vocabulary usage, resulting in linguistically correct responses.
When to use it? Use it when evaluating the linguistic correctness of the AI-generated text, ensuring that it adheres to proper grammatical rules, syntactic structures, and vocabulary usage in the generated responses.
What does it need as input? Question, Generated Answer

Built-in prompt used by Large Language Model judge to score this metric:

Fluency measures the quality of individual sentences in the answer, and whether they are well-written and grammatically correct. Consider the quality of individual sentences when evaluating fluency. Given the question and answer, score the fluency of the answer between one to five stars using the following rating scale: 

One star: the answer completely lacks fluency 

Two stars: the answer mostly lacks fluency 

Three stars: the answer is partially fluent 

Four stars: the answer is mostly fluent 

Five stars: the answer has perfect fluency 

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. 

AI-assisted: Retrieval Score

Score characteristics Score details
Score range Float [1-5]: where 1 is bad and 5 is good
What is this metric? Measures the extent to which the model's retrieved documents are pertinent and directly related to the given questions.
How does it work? Retrieval score measures the quality and relevance of the retrieved document to the user's question (summarized within the whole conversation history). Steps: Step 1: Break down user query into intents, Extract the intents from user query like “How much is the Azure linux VM and Azure Windows VM?” -> Intent would be [“what’s the pricing of Azure Linux VM?”, “What’s the pricing of Azure Windows VM?”]. Step 2: For each intent of user query, ask the model to assess if the intent itself or the answer to the intent is present or can be inferred from retrieved documents. The answer can be “No”, or “Yes, documents [doc1], [doc2]…”. “Yes” means the retrieved documents relate to the intent or answer to the intent, and vice versa. Step 3: Calculate the fraction of the intents that have an answer starting with “Yes”. In this case, all intents have equal importance. Step 4: Finally, square the score to penalize the mistakes.
When to use it? Use the retrieval score when you want to guarantee that the documents retrieved are highly relevant for answering your users' questions. This score helps ensure the quality and appropriateness of the retrieved content.
What does it need as input? Question, Context, Generated Answer

Built-in prompt used by Large Language Model judge to score this metric:

A chat history between user and bot is shown below 

A list of documents is shown below in json format, and each document has one unique id.  

These listed documents are used as contex to answer the given question. 

The task is to score the relevance between the documents and the potential answer to the given question in the range of 1 to 5.  

1 means none of the documents is relevant to the question at all. 5 means either one of the document or combination of a few documents is ideal for answering the given question. 

Think through step by step: 

- Summarize each given document first 

- Determine the underlying intent of the given question, when the question is ambiguous, refer to the given chat history  

- Measure how suitable each document to the given question, list the document id and the corresponding relevance score.  

- Summarize the overall relevance of given list of documents to the given question after # Overall Reason, note that the answer to the question can soley from single document or a combination of multiple documents.  

- Finally, output "# Result" followed by a score from 1 to 5.  

  

# Question 

{{ query }} 

# Chat History 

{{ history }} 

# Documents 

---BEGIN RETRIEVED DOCUMENTS--- 

{{ FullBody }} 

---END RETRIEVED DOCUMENTS--- 

AI-assisted: GPT-Similarity

Score characteristics Score details
Score range Integer [1-5]: where 1 is bad and 5 is good
What is this metric? Measures the similarity between a source data (ground truth) sentence and the generated response by an AI model.
How does it work? The GPT-similarity measure evaluates the likeness between a ground truth sentence (or document) and the AI model's generated prediction. This calculation involves creating sentence-level embeddings for both the ground truth and the model's prediction, which are high-dimensional vector representations capturing the semantic meaning and context of the sentences.
When to use it? Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. GPT-similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy.
What does it need as input? Question, Ground Truth Answer, Generated Answer

Built-in prompt used by Large Language Model judge to score this metric:

GPT-Similarity, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale: 

One star: the predicted answer is not at all similar to the correct answer 

Two stars: the predicted answer is mostly not similar to the correct answer 

Three stars: the predicted answer is somewhat similar to the correct answer 

Four stars: the predicted answer is mostly similar to the correct answer 

Five stars: the predicted answer is completely similar to the correct answer 

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. 

Traditional machine learning: F1 Score

Score characteristics Score details
Score range Float [0-1]
What is this metric? Measures the ratio of the number of shared words between the model generation and the ground truth answers.
How does it work? The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the generation, and recall is the ratio of the number of shared words to the total number of words in the ground truth.
When to use it? Use the F1 score when you want a single comprehensive metric that combines both recall and precision in your model's responses. It provides a balanced evaluation of your model's performance in terms of capturing accurate information in the response.
What does it need as input? Question, Ground Truth Answer, Generated Answer

Next steps