Model benchmarks

Important

Some of the features described in this article might only be available in preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In Azure AI Studio, you can compare benchmarks across models and datasets available in the industry to assess which one meets your business scenario. You can find Model benchmarks under Get started in the left side menu in Azure AI Studio.

Screenshot of dashboard view graph of model accuracy.

Model benchmarks help you make informed decisions about the sustainability of models and datasets before initiating any job. The benchmarks are a curated list of the best performing models for a given task, based on a comprehensive comparison of benchmarking metrics. Currently, Azure AI Studio provides benchmarking across the following types of models based on our model catalog collections:

  • Benchmarks across LLMs and SLMs
  • Benchmarks across embeddings models

You can switch between the Quality benchmarks and Embeddings benchmarks views by clicking on the corresponding tabs within the model benchmarks experience in AI Studio.

Benchmarking of LLMs and SLMs

Model benchmarks assess the quality of LLMs and SLMs across various metrics listed below:

Metric Description
Accuracy Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is exact-match in all cases except for the HumanEval dataset that uses a pass@1 metric. Exact match simply compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. Pass@1 measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model.
Coherence Coherence evaluates how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
Fluency Fluency evaluates the language proficiency of a generative AI's predicted answer. It assesses how well the generated text adheres to grammatical rules, syntactic structures, and appropriate usage of vocabulary, resulting in linguistically correct and natural-sounding responses.
GPTSimilarity GPTSimilarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It's calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.
Groundedness Groundedness measures how well the language model's generated answers align with information from the input source.
Relevance Relevance measures the extent to which the language model's generated responses are pertinent and directly related to the given questions.

The benchmarks are updated regularly as new metrics and datasets are added to existing models, and as new models are added to the model catalog.

Benchmarking of embedding models

Model benchmarks assess embeddings models across various metrics listed in the table:

Metric Description
Accuracy Accuracy is the proportion of correct predictions among the total number of predictions processed.
F1 Score F1 Score is the weighted mean of the precision and recall, where the best value is 1 (perfect precision and recall), and worst is 0.
Mean Average Precision (MAP) Mean Average Precision (MAP) evaluates the quality of ranking and recommender systems. It measures both the relevance of suggested items and how good the system is at placing more relevant items at the top. Values can range from 0 to 1. The higher the MAP, the better the system can place relevant items high in the list.
Normalized Discounted Cumulative Gain (NDCG) Normalized Discounted Cumulative Gain evaluates a machine learning algorithm's ability to sort items based on relevance. It compares rankings to an ideal order where all relevant items are at the top of the list, where k is the list length while evaluating ranking quality. In our benchmarks, k=10, indicated by a metrics of ndcg_at_10, meaning that we look at the top 10 items.
Precision Precision measures the model's ability to identify instances of a particular class correctly. Precision shows how often an ML model is correct when predicting the target class.
Spearman Correlation Spearman Correlation is based on cosine similarity. It is calculated by first computing the cosine similarity between variables, then ranking these scores and using the ranks to compute the Spearman Correlation.
V-measure V-measure is a metric used to evaluate the quality of clustering. It's calculated as a harmonic mean of homogeneity and completeness, ensuring a balance between the two for a meaningful score. Possible scores lie between 0 and 1, with 1 being perfectly complete labeling.

How the scores are calculated

The benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics.

Prompt construction follows best practices for each dataset, defined by the paper introducing the dataset and industry standard. In most cases, each prompt contains several examples of complete questions and answers, or "shots," to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that is held out from evaluation.

View options in the model benchmarks

These benchmarks encompass both a dashboard view and a list view of the data for ease of comparison, and helpful information that explains what the calculated metrics mean.

Dashboard view allows you to compare the scores of multiple models across datasets and tasks. You can view models side by side (horizontally along the x-axis) and compare their scores (vertically along the y-axis) for each metric.

You can filter the dashboard view by task, model collection, model name, dataset, and metric.

You can switch from dashboard view to list view by following these quick steps:

  1. Select the models you want to compare.

  2. Select List on the right side of the page.

    Screenshot of dashboard view graph with question answering filter applied and 'List' button identified.

In list view you can find the following information:

  • Model name, description, version, and aggregate scores.
  • Benchmark datasets (such as AGIEval) and tasks (such as question answering) that were used to evaluate the model.
  • Model scores per dataset.

You can also filter the list view by task, model collection, model name, dataset, and metric.

Screenshot of list view table displaying accuracy metrics in an ordered list.

Next steps