Azure AI Search Evaluation - Incorrect scoring, almost always a score of 1

Axel Nielsen 5 Reputation points
2024-03-07T14:37:07.55+00:00

I belive that I have followed the documentation here for implementing evaluation on a dataset, using "questions and answers": https://learn.microsoft.com/en-us/azure/ai-studio/how-to/evaluate-generative-ai-app?pivots=programming-language-python

As mentioned in the documentation I have a jsonl document as a dataset that I want to evaluate. It has the following structure, for each line:

{"question": "", "context": "", "answer": ""}

And then I run:

from azure.ai.generative.evaluate import evaluate

result = evaluate( 
                evaluation_name="my-qa-eval-with-data-gpt4",
                data=dataset,
                task_type="qa", 
                model_config= {
                        "api_version": os.getenv('AZURE_OPENAI_API_VERSION'),
                        "api_base": os.getenv('AZURE_OPENAI_ENDPOINT'),
                        "api_type": "azure",
                        "api_key": os.getenv('AZURE_OPENAI_API_KEY'),
                        "deployment_id": "gpt-4"
                },
                metrics_list=["gpt_groundedness","gpt_relevance","gpt_coherence","gpt_fluency","gpt_similarity", "ada_similarity"], 
                data_mapping={
                "questions":"question",
                "contexts":"context",
                "y_pred":"answer",
                "y_test":"ground_truth"
                },
                output_path="myevalresults",
                tracking_uri=client.tracking_uri
                )

These mappings are the same as in the documentation, however I get a warning that "y_pred" and "y_test" are depricated and should be replaced by "ground_truth" and "answer" respectivley. When replacing I get however the same results and the same problem.

So these are the problem I am experiencing:

Firstly, the score for gpt_similarity, which I am mostly interested in, seems to be quite inaccurate and mostly rate either the highest score 5, or the lowest score 1. So some answers that I belive are quite close to the ground_truth are scored at 1. Here is an attached image of a example where a similarity score of 1 seems way to low, since the answer is correct and very similar to the ground_truth. The ada_similarity is more accurate here, with a score of about 0.96.

User's image

Secondly, the scores for gpt_relevance, gpt_coherence and gpt_fluency are always 1: The context is given as a string, here is the context of the example given above:

"[{\"location\": {\"latitude\": 58.4048, \"longitude\": 15.6265}, \"lines\": [{\"lineNumber\": \"546\", \"lineNumberUnique\": \"546\"}, {\"lineNumber\": \"555\", \"lineNumberUnique\": \"555\"}, {\"lineNumber\": \"566\", \"lineNumberUnique\": \"566\"}, {\"lineNumber\": \"530\", \"lineNumberUnique\": \"530\"}, {\"lineNumber\": \"539\", \"lineNumberUnique\": \"539\"}, {\"lineNumber\": \"18\", \"lineNumberUnique\": \"218\"}, {\"lineNumber\": \"16\", \"lineNumberUnique\": \"216\"}, {\"lineNumber\": \"17\", \"lineNumberUnique\": \"217\"}, {\"lineNumber\": \"15\", \"lineNumberUnique\": \"215\"}, {\"lineNumber\": \"14\", \"lineNumberUnique\": \"214\"}, {\"lineNumber\": \"13\", \"lineNumberUnique\": \"213\"}, {\"lineNumber\": \"10\", \"lineNumberUnique\": \"210\"}, {\"lineNumber\": \"5\", \"lineNumberUnique\": \"205\"}, {\"lineNumber\": \"4\", \"lineNumberUnique\": \"204\"}, {\"lineNumber\": \"2\", \"lineNumberUnique\": \"202\"}, {\"lineNumber\": \"1\", \"lineNumberUnique\": \"201\"}, {\"lineNumber\": \"50\", \"lineNumberUnique\": \"50\"}, {\"lineNumber\": \"70\", \"lineNumberUnique\": \"70\"}, {\"lineNumber\": \"52\", \"lineNumberUnique\": \"52\"}, {\"lineNumber\": \"65\", \"lineNumberUnique\": \"65\"}, {\"lineNumber\": \"71\", \"lineNumberUnique\": \"71\"}, {\"lineNumber\": \"72\", \"lineNumberUnique\": \"72\"}, {\"lineNumber\": \"30\", \"lineNumberUnique\": \"30\"}], \"uid\": \"VGlubmVyYsOkY2tzYmFkZXQ=\", \"numberOfLines\": 23, \"name\": \"Tinnerb\\u00e4cksbadet\", \"@search.score\": 4.9410563, \"@search.reranker_score\": null, \"@search.highlights\": null, \"@search.captions\": null}, {\"LLM_INSTRUCTIONS\": \"\"}]"

Thanks for any help,

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,248 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.