Azure AI Search Evaluation - Incorrect scoring, almost always a score of 1

Question

Azure AI Search Evaluation - Incorrect scoring, almost always a score of 1

Axel Nielsen 5

I belive that I have followed the documentation here for implementing evaluation on a dataset, using "questions and answers": https://learn.microsoft.com/en-us/azure/ai-studio/how-to/evaluate-generative-ai-app?pivots=programming-language-python

As mentioned in the documentation I have a jsonl document as a dataset that I want to evaluate. It has the following structure, for each line:

{"question": "", "context": "", "answer": ""}

And then I run:

from azure.ai.generative.evaluate import evaluate

result = evaluate( 
                evaluation_name="my-qa-eval-with-data-gpt4",
                data=dataset,
                task_type="qa", 
                model_config= {
                        "api_version": os.getenv('AZURE_OPENAI_API_VERSION'),
                        "api_base": os.getenv('AZURE_OPENAI_ENDPOINT'),
                        "api_type": "azure",
                        "api_key": os.getenv('AZURE_OPENAI_API_KEY'),
                        "deployment_id": "gpt-4"
                },
                metrics_list=["gpt_groundedness","gpt_relevance","gpt_coherence","gpt_fluency","gpt_similarity", "ada_similarity"], 
                data_mapping={
                "questions":"question",
                "contexts":"context",
                "y_pred":"answer",
                "y_test":"ground_truth"
                },
                output_path="myevalresults",
                tracking_uri=client.tracking_uri
                )

These mappings are the same as in the documentation, however I get a warning that "y_pred" and "y_test" are depricated and should be replaced by "ground_truth" and "answer" respectivley. When replacing I get however the same results and the same problem.

So these are the problem I am experiencing:

Firstly, the score for gpt_similarity, which I am mostly interested in, seems to be quite inaccurate and mostly rate either the highest score 5, or the lowest score 1. So some answers that I belive are quite close to the ground_truth are scored at 1. Here is an attached image of a example where a similarity score of 1 seems way to low, since the answer is correct and very similar to the ground_truth. The ada_similarity is more accurate here, with a score of about 0.96.

User's image

Secondly, the scores for gpt_relevance, gpt_coherence and gpt_fluency are always 1: The context is given as a string, here is the context of the example given above:

"[{\"location\": {\"latitude\": 58.4048, \"longitude\": 15.6265}, \"lines\": [{\"lineNumber\": \"546\", \"lineNumberUnique\": \"546\"}, {\"lineNumber\": \"555\", \"lineNumberUnique\": \"555\"}, {\"lineNumber\": \"566\", \"lineNumberUnique\": \"566\"}, {\"lineNumber\": \"530\", \"lineNumberUnique\": \"530\"}, {\"lineNumber\": \"539\", \"lineNumberUnique\": \"539\"}, {\"lineNumber\": \"18\", \"lineNumberUnique\": \"218\"}, {\"lineNumber\": \"16\", \"lineNumberUnique\": \"216\"}, {\"lineNumber\": \"17\", \"lineNumberUnique\": \"217\"}, {\"lineNumber\": \"15\", \"lineNumberUnique\": \"215\"}, {\"lineNumber\": \"14\", \"lineNumberUnique\": \"214\"}, {\"lineNumber\": \"13\", \"lineNumberUnique\": \"213\"}, {\"lineNumber\": \"10\", \"lineNumberUnique\": \"210\"}, {\"lineNumber\": \"5\", \"lineNumberUnique\": \"205\"}, {\"lineNumber\": \"4\", \"lineNumberUnique\": \"204\"}, {\"lineNumber\": \"2\", \"lineNumberUnique\": \"202\"}, {\"lineNumber\": \"1\", \"lineNumberUnique\": \"201\"}, {\"lineNumber\": \"50\", \"lineNumberUnique\": \"50\"}, {\"lineNumber\": \"70\", \"lineNumberUnique\": \"70\"}, {\"lineNumber\": \"52\", \"lineNumberUnique\": \"52\"}, {\"lineNumber\": \"65\", \"lineNumberUnique\": \"65\"}, {\"lineNumber\": \"71\", \"lineNumberUnique\": \"71\"}, {\"lineNumber\": \"72\", \"lineNumberUnique\": \"72\"}, {\"lineNumber\": \"30\", \"lineNumberUnique\": \"30\"}], \"uid\": \"VGlubmVyYsOkY2tzYmFkZXQ=\", \"numberOfLines\": 23, \"name\": \"Tinnerb\\u00e4cksbadet\", \"@search.score\": 4.9410563, \"@search.reranker_score\": null, \"@search.highlights\": null, \"@search.captions\": null}, {\"LLM_INSTRUCTIONS\": \"\"}]"

Thanks for any help,

romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-03-08T08:43:00.3333333+00:00

@Axel Nielsen With respect to the warnings seen while using the SDK, I see that this mapping is deprecated and the current version of SDK should replace them with new values, see here from the SDK repo. To avoid this you can directly use "answer": value and "ground_truth": value in your data mapping. Could you retry again with the updated mapping to avoid warnings and then check the evaluation.

With respect to metrics, are the same results seen when you use AI studio? I am not totally familiar with the metrics, but going by the reasoning provided in the documentation, a low score could be its inability to verify with ground truth.

You could also raise a support case to seek clarification from service team on how this evaluation rated a low score if it has access to all the required data and then make the required changes to retest. Thanks!!
Axel Nielsen 5 Reputation points

2024-03-11T10:24:09.7366667+00:00
@romungi-MSFT Thanks a lot for the quick answer!

I have now removed the mappings for "answers" and "ground_truth" as you said and I still get the same problem, with inaccurate metrics, unfortunately. However I get this error when running the evalute function but I suspect that this is not related to the problems I am experiencing:

C:\[project-path]\env\Lib\site-packages\promptflow\_sdk\operations\_local_storage_operations.py:472: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '(Failed)' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. outputs.fillna(value="(Failed)", inplace=True) # replace nan with explicit prompt

The RAG powered LLM I want to evaluate has been developed locally in python, not using prompt-flow, so I don't think there is a way to generate results directly in AI studio if that is what you mean?

That is true, maybe the next step I should do is create a support request?
romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-03-13T07:56:20.1966667+00:00

@Axel Nielsen Yes, please raise a support case in this case to understand how the metrics are scored for the data that you have provided. Thanks!!

Your answer

romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-03-08T08:43:00.3333333+00:00

@Axel Nielsen With respect to the warnings seen while using the SDK, I see that this mapping is deprecated and the current version of SDK should replace them with new values, see here from the SDK repo. To avoid this you can directly use "answer": value and "ground_truth": value in your data mapping. Could you retry again with the updated mapping to avoid warnings and then check the evaluation.

With respect to metrics, are the same results seen when you use AI studio? I am not totally familiar with the metrics, but going by the reasoning provided in the documentation, a low score could be its inability to verify with ground truth.

You could also raise a support case to seek clarification from service team on how this evaluation rated a low score if it has access to all the required data and then make the required changes to retest. Thanks!!
Axel Nielsen 5 Reputation points

2024-03-11T10:24:09.7366667+00:00

@romungi-MSFT Thanks a lot for the quick answer!

I have now removed the mappings for "answers" and "ground_truth" as you said and I still get the same problem, with inaccurate metrics, unfortunately. However I get this error when running the evalute function but I suspect that this is not related to the problems I am experiencing:

C:\[project-path]\env\Lib\site-packages\promptflow\_sdk\operations\_local_storage_operations.py:472: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '(Failed)' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. outputs.fillna(value="(Failed)", inplace=True) # replace nan with explicit prompt

The RAG powered LLM I want to evaluate has been developed locally in python, not using prompt-flow, so I don't think there is a way to generate results directly in AI studio if that is what you mean?

That is true, maybe the next step I should do is create a support request?
romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-03-13T07:56:20.1966667+00:00

@Axel Nielsen Yes, please raise a support case in this case to understand how the metrics are scored for the data that you have provided. Thanks!!

Share via

Azure AI Search Evaluation - Incorrect scoring, almost always a score of 1

Your answer