共用方式為


RAG 效能基礎評估

本教學課程示範如何使用 Fabric 來評估 RAG 應用程式效能。 評估著重於兩個主要的 RAG 元件:擷取器 (Azure AI 搜尋) 和回應產生器 (使用使用者查詢、擷取的內容和提示來產生回覆的 LLM)。 以下是主要步驟:

  1. 設定 Azure OpenAI 和 Azure AI 搜尋服務
  2. 從 CMU 的維基百科文章 QA 資料集載入資料以建立基準
  3. 執行一次查詢以進行冒煙測試,確認 RAG 系統的端對端運作是否正常
  4. 定義確定性和 AI 輔助的評估指標
  5. 簽入 1:使用 top-N 準確度評估檢索器效能
  6. 檢查點 2:以基礎性、相關性和相似性指標評估回應產生器的效能
  7. 將評估結果視覺化並儲存在 OneLake 中,以供將來參考和持續評估

先決條件

開始本教學課程之前,請先完成在 Fabric 中建置擷取增強生成逐步指南

您需要下列服務才能執行筆記本:

在上一個教學課程中,您將資料上傳至湖庫,並建置 RAG 系統所使用的文件索引。 使用本練習中的索引來學習核心技術來評估 RAG 效能並識別潛在問題。 如果您未建立索引或移除索引,請遵循 快速入門指南 來完成先決條件。

顯示使用者透過 RAG 系統交談流程的圖表。

定義端點和必要的金鑰。 匯入所需的函式庫和函數。 具現化 Azure OpenAI 和 Azure AI 搜尋服務的用戶端。 定義一個函式包裝器,並設計提示以查詢 RAG 系統。

# Enter your Azure OpenAI service values
aoai_endpoint = "https://<your-resource-name>.openai.azure.com" # TODO: Provide the Azure OpenAI resource endpoint (replace <your-resource-name>)
aoai_key = "" # TODO: Fill in your API key from Azure OpenAI 
aoai_deployment_name_embeddings = "text-embedding-ada-002"
aoai_model_name_query = "gpt-4-32k"  
aoai_model_name_metrics = "gpt-4-32k"
aoai_api_version = "2024-02-01"

# Setup key accesses to Azure AI Search
aisearch_index_name = "" # TODO: Create a new index name: must only contain lowercase, numbers, and dashes
aisearch_api_key = "" # TODO: Fill in your API key from Azure AI Search
aisearch_endpoint = "https://.search.windows.net" # TODO: Provide the url endpoint for your created Azure AI Search 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import os, requests, json

from datetime import datetime, timedelta
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

from pyspark.sql import functions as F
from pyspark.sql.functions import to_timestamp, current_timestamp, concat, col, split, explode, udf, monotonically_increasing_id, when, rand, coalesce, lit, input_file_name, regexp_extract, concat_ws, length, ceil
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, ArrayType, FloatType
from pyspark.sql import Row
import pandas as pd
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import (
    VectorizedQuery,
)
from azure.search.documents.indexes.models import (  
    SearchIndex,  
    SearchField,  
    SearchFieldDataType,  
    SimpleField,  
    SearchableField,   
    SemanticConfiguration,  
    SemanticPrioritizedFields,
    SemanticField,  
    SemanticSearch,
    VectorSearch,  
    HnswAlgorithmConfiguration,
    HnswParameters,  
    VectorSearchProfile,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
)

import openai 
from openai import AzureOpenAI
import uuid
import matplotlib.pyplot as plt
from synapse.ml.featurize.text import PageSplitter
import ipywidgets as widgets  
from IPython.display import display as w_display

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 6, Finished, Available, Finished)

# Configure access to OpenAI endpoint
openai.api_type = "azure"
openai.api_key = aoai_key
openai.api_base = aoai_endpoint
openai.api_version = aoai_api_version

# Create client for accessing embedding endpoint
embed_client = AzureOpenAI(
    api_version=aoai_api_version,
    azure_endpoint=aoai_endpoint,
    api_key=aoai_key,
)

# Create client for accessing chat endpoint
chat_client = AzureOpenAI(
    azure_endpoint=aoai_endpoint,
    api_key=aoai_key,
    api_version=aoai_api_version,
)

# Configure access to Azure AI Search
search_client = SearchClient(
    aisearch_endpoint,
    aisearch_index_name,
    credential=AzureKeyCredential(aisearch_api_key)
)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 7, Finished, Available, Finished)

下列函式會實作兩個主要的 RAG 元件 - 擷取器 (get_context_source) 和回應產生器 (get_answer)。 程式碼與上一個教學類似。 此 topN 參數可讓您設定要擷取的相關資源數量 (本教學課程使用 3,但最佳值可能會因資料集而異):

# Implement retriever
def get_context_source(question, topN=3):
    """
    Retrieves contextual information and sources related to a given question using embeddings and a vector search.  
    Parameters:  
    question (str): The question for which the context and sources are to be retrieved.  
    topN (int, optional): The number of top results to retrieve. Default is 3.  
      
    Returns:  
    List: A list containing two elements:  
        1. A string with the concatenated retrieved context.  
        2. A list of retrieved source paths.  
    """
    embed_client = openai.AzureOpenAI(
        api_version=aoai_api_version,
        azure_endpoint=aoai_endpoint,
        api_key=aoai_key,
    )

    query_embedding = embed_client.embeddings.create(input=question, model=aoai_deployment_name_embeddings).data[0].embedding

    vector_query = VectorizedQuery(vector=query_embedding, k_nearest_neighbors=topN, fields="Embedding")

    results = search_client.search(   
        vector_queries=[vector_query],
        top=topN,
    )

    retrieved_context = ""
    retrieved_sources = []
    for result in results:
        retrieved_context += result['ExtractedPath'] + "\n" + result['Chunk'] + "\n\n"
        retrieved_sources.append(result['ExtractedPath'])

    return [retrieved_context, retrieved_sources]

# Implement response generator
def get_answer(question, context):
    """  
    Generates a response to a given question using provided context and an Azure OpenAI model.  
    
    Parameters:  
        question (str): The question that needs to be answered.  
        context (str): The contextual information related to the question that will help generate a relevant response.  
    
    Returns:  
        str: The response generated by the Azure OpenAI model based on the provided question and context.  
    """
    messages = [
        {
            "role": "system",
            "content": "You are a chat assistant. Use provided text to ground your response. Give a one-word answer when possible ('yes'/'no' is OK where appropriate, no details). Unnecessary words incur a $500 penalty."
        }
    ]

    messages.append(
        {
            "role": "user", 
            "content": question + "\n" + context,
        },
    )

    chat_client = openai.AzureOpenAI(
        azure_endpoint=aoai_endpoint,
        api_key=aoai_key,
        api_version=aoai_api_version,
    )

    chat_completion = chat_client.chat.completions.create(
        model=aoai_model_name_query,
        messages=messages,
    )

    return chat_completion.choices[0].message.content

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 8, Finished, Available, Finished)

Dataset

卡內基梅隆大學 Question-Answer 資料集的 1.2 版是維基百科文章的語料庫,其中包含手動編寫的事實問題和答案。 它裝載於 GFDL 下的 Azure Blob 儲存體中。 資料集使用一個具有下列欄位的資料表:

  • ArticleTitle:問題和答案來自的維基百科條目的名稱
  • Question:關於文章的手寫問題
  • Answer:根據文章手動撰寫答案
  • DifficultyFromQuestioner:對作者分配的問題進行難度評分
  • DifficultyFromAnswerer:評估者分配的難度評級,可以與 DifficultyFromQuestioner 不同
  • ExtractedPath:原始文章的路徑(一篇文章可以有多個問答對)
  • text:清理了維基百科條目文本

從同一位置下載 LICENSE-S08 和 LICENSE-S09 檔案以取得授權詳細資訊。

歷史和引文

對資料集使用此引文:

CMU Question/Answer Dataset, Release 1.2
August 23, 2013
Noah A. Smith, Michael Heilman, and Rebecca Hwa
Question Generation as a Competitive Undergraduate Course Project
In Proceedings of the NSF Workshop on the Question Generation Shared Task and Evaluation Challenge, Arlington, VA, September 2008. 
Available at http://www.cs.cmu.edu/~nasmith/papers/smith+heilman+hwa.nsf08.pdf.
Original dataset acknowledgments:
This research project was supported by NSF IIS-0713265 (to Smith), an NSF Graduate Research Fellowship (to Heilman), NSF IIS-0712810 and IIS-0745914 (to Hwa), and Institute of Education Sciences, U.S. Department of Education R305B040063 (to Carnegie Mellon).
cmu-qa-08-09 (modified version)
June 12, 2024
Amir Jafari, Alexandra Savelieva, Brice Chung, Hossein Khadivi Heris, Journey McDowell
This release uses the GNU Free Documentation License (GFDL) (http://www.gnu.org/licenses/fdl.html).
The GNU license applies to all copies of the dataset.

建立基準

匯入基準。 在此示範中,請使用來自 S08/set1S08/set2 貯體的問題子集。 若要每篇文章保留一個問題,請套用 df.dropDuplicates(["ExtractedPath"])。 刪除重複的問題。 策展過程增加了難度標籤;此範例會將它們限制為 medium

df = spark.sql("SELECT * FROM data_load_tests.cmu_qa")

# Filter the DataFrame to include the specified paths
df = df.filter((col("ExtractedPath").like("S08/data/set1/%")) | (col("ExtractedPath").like("S08/data/set2/%")))

# Keep only medium-difficulty questions.
df = df.filter(col("DifficultyFromQuestioner") == "medium")


# Drop duplicate questions and source paths.
df = df.dropDuplicates(["Question"])
df = df.dropDuplicates(["ExtractedPath"])

num_rows = df.count()
num_columns = len(df.columns)
print(f"Number of rows: {num_rows}, Number of columns: {num_columns}")

# Persist the DataFrame
df.persist()
display(df)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 9, Finished, Available, Finished)Number of rows: 20, Number of columns: 7SynapseWidget(Synapse.DataFrame, 47aff8cb-72f8-4a36-885c-f4f3bb830a91)

結果是一個有 20 行的 DataFrame,即演示標準。 關鍵欄位包括 QuestionAnswer(人工編輯的參考真實答案)、和 ExtractedPath(來源文件)。 調整過濾器以包含其他問題並改變難度以獲得更真實的範例。 試試看。

執行簡單的端對端測試

從檢索增強生成(RAG)的端到端初步測試開始。

question = "How many suborders are turtles divided into?"
retrieved_context, retrieved_sources = get_context_source(question)
answer = get_answer(question, retrieved_context)
print(answer)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 10, Finished, Available, Finished)Three

此冒煙測試可協助您找出 RAG 實作中的問題,例如認證不正確、向量索引遺失或空白,或函式介面不相容。 如果測試失敗,請檢查是否有問題。 預期輸出: Three。 如果煙霧測試通過,請轉到下一節以進一步評估 RAG。

建立指標

定義確定性指標來評估擷取器。 它的靈感來自搜尋引擎。 它會檢查擷取的來源清單是否包含基本事實來源。 此量度是前 N 個精確度分數,因為參數 topN 會設定擷取來源的數目。

def get_retrieval_score(target_source, retrieved_sources):
    if target_source in retrieved_sources: 
        return 1
    else: 
        return 0

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 11, Finished, Available, Finished)

根據基準測試,答案包含在帶有 ID "S08/data/set1/a9"的來源中。 在我們上面運行的範例中測試此函數,如預期般傳回 1,因為它位於前面三個相關的文字區塊中。

print("Retrieved sources:", retrieved_sources)
get_retrieval_score("S08/data/set1/a9", retrieved_sources)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 12, Finished, Available, Finished)Retrieved sources: ['S08/data/set1/a9', 'S08/data/set1/a9', 'S08/data/set1/a5']1

本節定義 AI 輔助指標。 提示模板中包含幾個輸入範例(CONTEXT 和 ANSWER)和建議的輸出(也稱為少樣本模型)。 這與 Azure AI Studio 中使用的提示相同。 在 內建評估指標中深入瞭解。 此示範使用 和 groundednessrelevance 指標 - 這些通常是評估 GPT 模型最有用和最可靠的指標。 其他指標可能很有用,但提供的直覺較少 - 例如,答案不必相似才能正確,因此 similarity 分數可能會產生誤導。 所有指標的比例為 1 到 5。 越高越好。 基礎性僅採用兩個輸入(上下文和產生的答案),而其他兩個指標也使用基本事實進行評估。

def get_groundedness_metric(context, answer):
    """Get the groundedness score from the LLM using the context and answer."""

    groundedness_prompt_template = """
    You are presented with a CONTEXT and an ANSWER about that CONTEXT. Decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following ratings:
    1. 5: The ANSWER follows logically from the information contained in the CONTEXT.
    2. 1: The ANSWER is logically false from the information contained in the CONTEXT.
    3. an integer score between 1 and 5 and if such integer score does not exist, use 1: It is not possible to determine whether the ANSWER is true or false without further information. Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails. Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.
    Independent Examples:
    ## Example Task #1 Input:
    "CONTEXT": "Some are reported as not having been wanted at all.", "QUESTION": "", "ANSWER": "All are reported as being completely and fully wanted."
    ## Example Task #1 Output:
    1
    ## Example Task #2 Input:
    "CONTEXT": "Ten new television shows appeared during the month of September. Five of the shows were sitcoms, three were hourlong dramas, and two were news-magazine shows. By January, only seven of these new shows were still on the air. Five of the shows that remained were sitcoms.", "QUESTION": "", "ANSWER": "At least one of the shows that were cancelled was an hourlong drama."
    ## Example Task #2 Output:
    5
    ## Example Task #3 Input:
    "CONTEXT": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is neither French nor English.", "QUESTION": "", "ANSWER": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is not French."
    5
    ## Example Task #4 Input:
    "CONTEXT": "Some are reported as not having been wanted at all.", "QUESTION": "", "ANSWER": "All are reported as being completely and fully wanted."
    ## Example Task #4 Output:
    1
    ## Actual Task Input:
    "CONTEXT": {context}, "QUESTION": "", "ANSWER": {answer}
    Reminder: The return values for each task should be correctly formatted as an integer between 1 and 5. Do not repeat the context and question.  Don't explain the reasoning. The answer should include only a number: 1, 2, 3, 4, or 5.
    Actual Task Output:
    """

    metric_client = openai.AzureOpenAI(
        api_version=aoai_api_version,
        azure_endpoint=aoai_endpoint,
        api_key=aoai_key,
    )

    messages = [
        {
            "role": "system",
            "content": "You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric."
        }, 
        {
            "role": "user",
            "content": groundedness_prompt_template.format(context=context, answer=answer)
        }
    ]

    metric_completion = metric_client.chat.completions.create(
        model=aoai_model_name_metrics,
        messages=messages,
        temperature=0,
    )

    return metric_completion.choices[0].message.content

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 13, Finished, Available, Finished)

def get_relevance_metric(context, question, answer):    
    relevance_prompt_template = """
    Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
    One star: the answer completely lacks relevance
    Two stars: the answer mostly lacks relevance
    Three stars: the answer is partially relevant
    Four stars: the answer is mostly relevant
    Five stars: the answer has perfect relevance

    This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

    context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
    question: What field did Marie Curie excel in?
    answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
    stars: 1

    context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
    question: Where were The Beatles formed?
    answer: The band The Beatles began their journey in London, England, and they changed the history of music.
    stars: 2

    context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
    question: What are the main goals of Perseverance Mars rover mission?
    answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
    stars: 3

    context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
    question: What are the main components of the Mediterranean diet?
    answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
    stars: 4

    context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
    question: What are the main attractions of the Queen's Royal Castle?
    answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
    stars: 5

    Don't explain the reasoning. The answer should include only a number: 1, 2, 3, 4, or 5.

    context: {context}
    question: {question}
    answer: {answer}
    stars:
    """

    metric_client = openai.AzureOpenAI(
        api_version=aoai_api_version,
        azure_endpoint=aoai_endpoint,
        api_key=aoai_key,
    )


    messages = [
        {
            "role": "system",
            "content": "You are an AI assistant. You are given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Compute an accurate evaluation score using the provided evaluation metric."
        }, 
        {
            "role": "user",
            "content": relevance_prompt_template.format(context=context, question=question, answer=answer)
        }
    ]

    metric_completion = metric_client.chat.completions.create(
        model=aoai_model_name_metrics,
        messages=messages,
        temperature=0,
    )

    return metric_completion.choices[0].message.content

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 14, Finished, Available, Finished)

def get_similarity_metric(question, ground_truth, answer):
    similarity_prompt_template = """
    Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
    One star: the predicted answer is not at all similar to the correct answer
    Two stars: the predicted answer is mostly not similar to the correct answer
    Three stars: the predicted answer is somewhat similar to the correct answer
    Four stars: the predicted answer is mostly similar to the correct answer
    Five stars: the predicted answer is completely similar to the correct answer

    This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

    The examples below show the Equivalence score for a question, a correct answer, and a predicted answer.

    question: What is the role of ribosomes?
    correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins.
    predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules.
    stars: 1

    question: Why did the Titanic sink?
    correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life.
    predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts.
    stars: 2

    question: What causes seasons on Earth?
    correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns.
    predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions.
    stars: 3

    question: How does photosynthesis work?
    correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions.
    predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions.
    stars: 4

    question: What are the health benefits of regular exercise?
    correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood.
    predicted answer: Routine physical activity can contribute to maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood.
    stars: 5

    Don't explain the reasoning. The answer should include only a number: 1, 2, 3, 4, or 5.

    question: {question}
    correct answer:{ground_truth}
    predicted answer: {answer}
    stars:
    """
    
    metric_client = openai.AzureOpenAI(
        api_version=aoai_api_version,
        azure_endpoint=aoai_endpoint,
        api_key=aoai_key,
    )

    messages = [
        {
            "role": "system",
            "content": "You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric."
        }, 
        {
            "role": "user",
            "content": similarity_prompt_template.format(question=question, ground_truth=ground_truth, answer=answer)
        }
    ]

    metric_completion = metric_client.chat.completions.create(
        model=aoai_model_name_metrics,
        messages=messages,
        temperature=0,
    )

    return metric_completion.choices[0].message.content

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 15, Finished, Available, Finished)

測試相關性指標:

get_relevance_metric(retrieved_context, question, answer)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 16, Finished, Available, Finished)'2'

5 分表示答案是相關的。 下列程式碼會取得相似性指標:

get_similarity_metric(question, 'three', answer)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 17, Finished, Available, Finished)'5'

分數為 5 表示答案與人類專家策劃的基本事實答案相符。 AI 輔助指標分數可能會因相同的輸入而波動。 它們比使用人類法官更快。

在 Q&A 基準上評估 RAG 的性能

建立函式包裝器以達成大規模運行。 將每個以 _udf 結尾的函數(_udf@udf(returnType=StructType([ ... ])) 的縮寫)包裝,使其符合 Spark 的需求(),並在整個叢集中對大型資料執行更快速的計算。

# UDF wrappers for RAG components
@udf(returnType=StructType([  
    StructField("retrieved_context", StringType(), True),  
    StructField("retrieved_sources", ArrayType(StringType()), True)  
]))
def get_context_source_udf(question, topN=3):
    return get_context_source(question, topN)

@udf(returnType=StringType())
def get_answer_udf(question, context):
    return get_answer(question, context)


# UDF wrapper for retrieval score
@udf(returnType=StringType())
def get_retrieval_score_udf(target_source, retrieved_sources):
    return get_retrieval_score(target_source, retrieved_sources)


# UDF wrappers for AI-assisted metrics
@udf(returnType=StringType())
def get_groundedness_metric_udf(context, answer):
    return get_groundedness_metric(context, answer)

@udf(returnType=StringType())
def get_relevance_metric_udf(context, question, answer): 
    return get_relevance_metric(context, question, answer)

@udf(returnType=StringType())
def get_similarity_metric_udf(question, ground_truth, answer):
    return get_similarity_metric(question, ground_truth, answer)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 18, Finished, Available, Finished)

檢查#1:檢索器的性能

下列程式碼會在基準測試 DataFrame 中建立 resultretrieval_score 資料行。 這些欄位包括由 RAG 產生的答案,以及指示器說明提供給 LLM 的背景資訊是否包含問題所依據的文章。

df = df.withColumn("result", get_context_source_udf(df.Question)).select(df.columns+["result.*"])
df = df.withColumn('retrieval_score', get_retrieval_score_udf(df.ExtractedPath, df.retrieved_sources))
print("Aggregate Retrieval score: {:.2f}%".format((df.where(df["retrieval_score"] == 1).count() / df.count()) * 100))
display(df.select(["question", "retrieval_score",  "ExtractedPath", "retrieved_sources"]))

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 19, Finished, Available, Finished)Aggregate Retrieval score: 100.00%SynapseWidget(Synapse.DataFrame, 14efe386-836a-4765-bd88-b121f32c7cfc)

對於所有問題,檢索器都會獲取正確的上下文,並且在大多數情況下,這是首個項目。 Azure AI 搜尋服務效能良好。 您可能想知道,在某些情況下,為什麼上下文有兩個或三個相同的值。 這不是錯誤 - 這意味著檢索器在拆分過程中獲取了同一條目中不適合一個塊的片段。

進度檢查#2 回應產生器的效能

將問題和上下文傳遞給 LLM 以產生答案。 將其儲存在 DataFrame 的欄中 generated_answer

df = df.withColumn('generated_answer', get_answer_udf(df.Question, df.retrieved_context))

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 20, Finished, Available, Finished)

使用產生的答案、基本事實答案、問題和內容來計算指標。 顯示每個問答組的評估結果:

df = df.withColumn('gpt_groundedness', get_groundedness_metric_udf(df.retrieved_context, df.generated_answer))
df = df.withColumn('gpt_relevance', get_relevance_metric_udf(df.retrieved_context, df.Question, df.generated_answer))
df = df.withColumn('gpt_similarity', get_similarity_metric_udf(df.Question, df.Answer, df.generated_answer))
display(df.select(["question", "answer", "generated_answer", "retrieval_score", "gpt_groundedness","gpt_relevance", "gpt_similarity"]))

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 21, Finished, Available, Finished)SynapseWidget(Synapse.DataFrame, 22b97d27-91e1-40f3-b888-3a3399de9d6b)

這些價值觀顯示了什麼? 為了使它們更容易解釋,請繪製紮根性、相關性和相似性的直方圖。 LLM 比人類的真實答案更冗長,這降低了相似性指標——大約一半的答案在語義上是正確的,但因為大部分相似而獲得四顆星。 這三個計量的大部分值都是 4 或 5,這表示 RAG 效能良好。 有一些例外 - 例如,模型針對問題 How many species of otter are there? 生成了 There are 13 species of otter,這是正確的,並且具有高度的相關性和相似性(5)。 出於某種原因,GPT 認為它在提供的上下文中沒有足夠的基礎,並給了它一顆星。 在其他三個案例中,至少有一個人工智慧輔助指標為一顆星,低分表明答案不好。 LLM 偶爾會得分錯誤,但通常得分準確。

# Convert Spark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

selected_columns = ['gpt_groundedness', 'gpt_relevance', 'gpt_similarity']
trimmed_df = pandas_df[selected_columns].astype(int)

# Define a function to plot histograms for the specified columns
def plot_histograms(dataframe, columns):
    # Set up the figure size and subplots
    plt.figure(figsize=(15, 5))
    for i, column in enumerate(columns, 1):
        plt.subplot(1, len(columns), i)
        # Filter the dataframe to only include rows with values 1, 2, 3, 4, 5
        filtered_df = dataframe[dataframe[column].isin([1, 2, 3, 4, 5])]
        filtered_df[column].hist(bins=range(1, 7), align='left', rwidth=0.8)
        plt.title(f'Histogram of {column}')
        plt.xlabel('Values')
        plt.ylabel('Frequency')
        plt.xticks(range(1, 6))
        plt.yticks(range(0, 20, 2))


# Call the function to plot histograms for the specified columns
plot_histograms(trimmed_df, selected_columns)

# Show the plots
plt.tight_layout()
plt.show()

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 24, Finished, Available, Finished)

直方圖的螢幕截圖,顯示評估問題的 GPT 相關性和相似性分數的分佈。

最後一個步驟,將基準測試結果儲存至湖庫中的資料表。 此步驟是可選的,但強烈建議使用 - 它使您的發現更有用。 當您變更 RAG 中的某些內容 (例如,修改提示、更新索引,或在回應產生器中使用不同的 GPT 模型) 時,請測量影響、量化改進,並偵測迴歸。

# create name of experiment that is easy to refer to
friendly_name_of_experiment = "rag_tutorial_experiment_1"

# Note the current date and time  
time_of_experiment = current_timestamp()

# Generate a unique GUID for all rows
experiment_id = str(uuid.uuid4())

# Add two new columns to the Spark DataFrame
updated_df = df.withColumn("execution_time", time_of_experiment) \
                        .withColumn("experiment_id", lit(experiment_id)) \
                        .withColumn("experiment_friendly_name", lit(friendly_name_of_experiment))

# Store the updated DataFrame in the default lakehouse as a table named 'rag_experiment_runs'
table_name = "rag_experiment_run_demo1" 
updated_df.write.format("parquet").mode("append").saveAsTable(table_name)

電池輸出:StatementMeta(, 21cb8cd3-7742-4c1f-8339-265e2846df1d, 28, Finished, Available, Finished)

隨時返回實驗結果以檢閱、與新實驗進行比較,並選擇最適合生產環境的設定。

總結

使用 AI 輔助的指標和 Top-N 檢索率來構建您的檢索增強型生成(RAG)解決方案。