Benutzerdefinierte Evaluatoren

2025-05-19

Integrierte Bewertungen eignen sich hervorragend für die sofort einsatzbereite Auswertung der Anwendungsgenerationen. Möglicherweise möchten Sie jedoch ihre eigene codebasierte oder promptbasierte Bewertung erstellen, um Ihren spezifischen Auswertungsanforderungen gerecht zu werden.

Codebasierte Bewertungen

Manchmal ist für bestimmte Auswertungsmetriken kein großes Sprachmodell erforderlich. Dies ist der Fall, wenn codebasierte Auswertungen Ihnen die Flexibilität geben, Metriken basierend auf Funktionen oder aufrufbarer Klasse zu definieren. Sie können beispielsweise einen eigenen codebasierten Evaluator erstellen, indem Sie eine einfache Python-Klasse erstellen, die die Länge einer Antwort in answer_length.py unter dem Verzeichnis answer_len/ berechnet:

Codebasiertes Auswertungsbeispiel: Antwortlänge

class AnswerLengthEvaluator:
    def __init__(self):
        pass
    # A class is made a callable my implementing the special method __call__
    def __call__(self, *, answer: str, **kwargs):
        return {"answer_length": len(answer)}

Führen Sie dann den Evaluator für eine Datenzeile aus, indem Sie eine aufrufbare Klasse importieren:

from answer_len.answer_length import AnswerLengthEvaluator

answer_length_evaluator = AnswerLengthEvaluator()
answer_length = answer_length_evaluator(answer="What is the speed of light?")

Codebasierte Auswertungsausgabe: Antwortlänge

{"answer_length":27}

Promptbasierte Bewertungen

Zum Erstellen einer eigenen promptbasierten Large Language Model-Auswertung oder KI-unterstützter Anmerkungen können Sie eine benutzerdefinierte Auswertung basierend auf einer Prompty-Datei erstellen. Prompty ist eine Datei mit .prompty-Erweiterung für die Entwicklung einer Promptvorlage. Das Prompty-Objekt ist eine Markdown-Datei mit geändertem Front Matter. Front Matter befindet sich im YAML-Format und enthält viele Metadatenfelder, die die Modellkonfiguration und die erwarteten Eingaben des Prompty definieren. Erstellen wir einen benutzerdefinierten Evaluator FriendlinessEvaluator, um die Freundlichkeit einer Antwort zu messen.

Promptbasiertes Bewertungsbeispiel: Freundlichkeits-Evaluator

Erstellen Sie zunächst eine friendliness.prompty Datei, die die Definition der Metrik "Freundlichkeit" und der Benotungsrubrik beschreibt:

---
name: Friendliness Evaluator
description: Friendliness Evaluator to measure warmth and approachability of answers.
model:
  api: chat
  configuration:
    type: azure_openai
    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
    azure_deployment: gpt-4o-mini
  parameters:
    model:
    temperature: 0.1
inputs:
  response:
    type: string
outputs:
  score:
    type: int
  explanation:
    type: string
---

system:
Friendliness assesses the warmth and approachability of the answer. Rate the friendliness of the response between one to five stars using the following scale:

One star: the answer is unfriendly or hostile

Two stars: the answer is mostly unfriendly

Three stars: the answer is neutral

Four stars: the answer is mostly friendly

Five stars: the answer is very friendly

Please assign a rating between 1 and 5 based on the tone and demeanor of the response.

**Example 1**
generated_query: I just don't feel like helping you! Your questions are getting very annoying.
output:
{"score": 1, "reason": "The response is not warm and is resisting to be providing helpful information."}
**Example 2**
generated_query: I'm sorry this watch is not working for you. Very happy to assist you with a replacement.
output:
{"score": 5, "reason": "The response is warm and empathetic, offering a resolution with care."}


**Here the actual conversation to be scored:**
generated_query: {{response}}
output:

Erstellen Sie dann eine Klasse FriendlinessEvaluator , um die Prompty-Datei zu laden und die Ausgaben im JSON-Format zu verarbeiten:

import os
import json
import sys
from promptflow.client import load_flow


class FriendlinessEvaluator:
    def __init__(self, model_config):
        current_dir = os.path.dirname(__file__)
        prompty_path = os.path.join(current_dir, "friendliness.prompty")
        self._flow = load_flow(source=prompty_path, model={"configuration": model_config})

    def __call__(self, *, response: str, **kwargs):
        llm_response = self._flow(response=response)
        try:
            response = json.loads(llm_response)
        except Exception as ex:
            response = llm_response
        return response

Jetzt können Sie einen eigenen Prompty-basierten Evaluator erstellen und auf einer Datenzeile ausführen:

from friendliness.friend import FriendlinessEvaluator

friendliness_eval = FriendlinessEvaluator(model_config)

friendliness_score = friendliness_eval(response="I will not apologize for my behavior!")

Promptbasierte Auswertungsausgabe: Freundlichkeits-Evaluator

{
    'score': 1, 
    'reason': 'The response is hostile and unapologetic, lacking warmth or approachability.'
}

Erfahren Sie , wie Sie die Batchauswertung für ein Dataset ausführen und die Batchauswertung für ein Ziel ausführen.

Freigeben über