Evaluating the performance of LLM summarization prompts

Abstractive summarization evaluation remains an area in which previous methods of automatic evaluation metrics have performed poorly when compared with human judgements. A technique known as G-Eval has been developed, which uses GPT-4 to evaluate the quality of summaries without a ground truth. This method shows state-of-the-art results when compared to human judgements, based on meta-evaluation against SummEval benchmark.

G-Eval builds on the concept of four key dimensions proposed by Kryscinski et al. (2019) to abstractive summary quality:

  • Coherence - the collective quality of all sentences in the summary. The summary should be well-structured and well-organized and should build a coherent body of information about a topic.
  • Consistency - factual alignment between the summary and source document. The summary should contain only statements that are entailed by the source document.
  • Fluency - the quality of individual sentences of the summary. The summary should have no formatting problems and grammatical errors that make the summary difficult to read.
  • Relevance - selection of the most important content from the source document. The summary should include only important information from the source document.

The implementation of the G-Eval technique involves four separate and detailed prompts designed to evaluate the summary output against each of these dimensions, with a score from 1-5 on a Likert scale (except for Fluency, which adopts a 1-3 scale). These prompts, along with the input document and summary to be evaluated, are fed to GPT-4; the score outputs are collected and final scores are calculated.

Key Features

  • State-of-the-art abstractive summarization evaluation method
  • Reference-free evaluation
  • Adopts Chain-of-Thought (CoT), a set of intermediate instructions generated by LLM describing the detailed evaluation steps to provide more context and guidance for LLM to evaluate the generated summary
  • Evaluates in a form-filling paradigm
  • Uses the probability-weighted summation of the output scores as the final score to obtain more fine-grained, continuous scores that better reflect the quality and diversity of the generated texts

Implementation

G-Eval now has an example implementation in the official prompt flow repository. In this implementation, the original G-Eval implementation prompts have been improved to be more generic and agnostic to the domain of the source data being evaluated. The score parser has also been improved for better performance, verified with Meta-evaluation over the SummEval benchmark. GPT-4 does not support the output of token probabilities. G-Eval sets n=20, temperature=2, top_p=1 to sample 20 times to estimate the token probabilities.

Results

G-Eval, and in particular this generalized implementation, shows state-of-the-art results when compared to other abstractive summarization evaluation methods.

Spearman correlations (ρ) between different methods and human judgements in the SummEval benchmark

Method Fluency (ρ) Consistency (ρ) Coherence (ρ) Relevance (ρ) Average
G-Eval - GPT-4 0613 8k + original prompts in paper 0.455 0.507 0.582 0.547 0.514
G-Eval - GPT-4 0613 8k + updated prompts + updated parser (Ours) 0.5402 0.5215 0.5137 0.4897 0.516
G-Eval - GPT-4 0613 32k + updated prompts + updated parser (Ours) 0.4985 0.4914 0.5038 0.4921 0.496
ROUGE-1 0.115 0.160 0.167 0.326 0.192
ROUGE-2 0.159 0.187 0.184 0.290 0.205
ROUGE-L 0.105 0.115 0.128 0.311 0.165
BERTScore 0.193 0.110 0.284 0.312 0.225
MOVERSScore 0.129 0.157 0.159 0.318 0.191
BARTScore 0.356 0.382 0.448 0.356 0.385
UniEval 0.449 0.446 0.575 0.426 0.474
GPTScore 0.403 0.449 0.434 0.381 0.417