Azure AI Search Evaluation - Incorrect scoring, almost always a score of 1
I belive that I have followed the documentation here for implementing evaluation on a dataset, using "questions and answers": https://learn.microsoft.com/en-us/azure/ai-studio/how-to/evaluate-generative-ai-app?pivots=programming-language-python
As mentioned in the documentation I have a jsonl document as a dataset that I want to evaluate. It has the following structure, for each line:
{"question": "", "context": "", "answer": ""}
And then I run:
from azure.ai.generative.evaluate import evaluate
result = evaluate(
evaluation_name="my-qa-eval-with-data-gpt4",
data=dataset,
task_type="qa",
model_config= {
"api_version": os.getenv('AZURE_OPENAI_API_VERSION'),
"api_base": os.getenv('AZURE_OPENAI_ENDPOINT'),
"api_type": "azure",
"api_key": os.getenv('AZURE_OPENAI_API_KEY'),
"deployment_id": "gpt-4"
},
metrics_list=["gpt_groundedness","gpt_relevance","gpt_coherence","gpt_fluency","gpt_similarity", "ada_similarity"],
data_mapping={
"questions":"question",
"contexts":"context",
"y_pred":"answer",
"y_test":"ground_truth"
},
output_path="myevalresults",
tracking_uri=client.tracking_uri
)
These mappings are the same as in the documentation, however I get a warning that "y_pred" and "y_test" are depricated and should be replaced by "ground_truth" and "answer" respectivley. When replacing I get however the same results and the same problem.
So these are the problem I am experiencing:
Firstly, the score for gpt_similarity, which I am mostly interested in, seems to be quite inaccurate and mostly rate either the highest score 5, or the lowest score 1. So some answers that I belive are quite close to the ground_truth are scored at 1. Here is an attached image of a example where a similarity score of 1 seems way to low, since the answer is correct and very similar to the ground_truth. The ada_similarity is more accurate here, with a score of about 0.96.
Secondly, the scores for gpt_relevance, gpt_coherence and gpt_fluency are always 1: The context is given as a string, here is the context of the example given above:
"[{\"location\": {\"latitude\": 58.4048, \"longitude\": 15.6265}, \"lines\": [{\"lineNumber\": \"546\", \"lineNumberUnique\": \"546\"}, {\"lineNumber\": \"555\", \"lineNumberUnique\": \"555\"}, {\"lineNumber\": \"566\", \"lineNumberUnique\": \"566\"}, {\"lineNumber\": \"530\", \"lineNumberUnique\": \"530\"}, {\"lineNumber\": \"539\", \"lineNumberUnique\": \"539\"}, {\"lineNumber\": \"18\", \"lineNumberUnique\": \"218\"}, {\"lineNumber\": \"16\", \"lineNumberUnique\": \"216\"}, {\"lineNumber\": \"17\", \"lineNumberUnique\": \"217\"}, {\"lineNumber\": \"15\", \"lineNumberUnique\": \"215\"}, {\"lineNumber\": \"14\", \"lineNumberUnique\": \"214\"}, {\"lineNumber\": \"13\", \"lineNumberUnique\": \"213\"}, {\"lineNumber\": \"10\", \"lineNumberUnique\": \"210\"}, {\"lineNumber\": \"5\", \"lineNumberUnique\": \"205\"}, {\"lineNumber\": \"4\", \"lineNumberUnique\": \"204\"}, {\"lineNumber\": \"2\", \"lineNumberUnique\": \"202\"}, {\"lineNumber\": \"1\", \"lineNumberUnique\": \"201\"}, {\"lineNumber\": \"50\", \"lineNumberUnique\": \"50\"}, {\"lineNumber\": \"70\", \"lineNumberUnique\": \"70\"}, {\"lineNumber\": \"52\", \"lineNumberUnique\": \"52\"}, {\"lineNumber\": \"65\", \"lineNumberUnique\": \"65\"}, {\"lineNumber\": \"71\", \"lineNumberUnique\": \"71\"}, {\"lineNumber\": \"72\", \"lineNumberUnique\": \"72\"}, {\"lineNumber\": \"30\", \"lineNumberUnique\": \"30\"}], \"uid\": \"VGlubmVyYsOkY2tzYmFkZXQ=\", \"numberOfLines\": 23, \"name\": \"Tinnerb\\u00e4cksbadet\", \"@search.score\": 4.9410563, \"@search.reranker_score\": null, \"@search.highlights\": null, \"@search.captions\": null}, {\"LLM_INSTRUCTIONS\": \"\"}]"
Thanks for any help,