Member-only story

Evaluation of LLM Models — A general Survey

KEEP IN TOUCH | THE GEN AI SERIES

Rahul S
9 min readJan 19, 2024

TASK SPECIFIC EVALUATION

SUMMARIZATION

A popular family of metrics used to assess text summarization is ROUGE or Recall-Oriented Understudy for Gisting Evaluation. ROUGE compares the model-generated text summary to the “golden”, human-written reference summary.

The simplest ROUGE metrics are ROUGE-1 Recall and ROUGE-1 Precision. To calculate them, we count the number of unigrams (words) that match between the two summaries.

  • ROUGE-1 Recall is the number of unigram matches divided by the total number of unigrams in the reference summary.
  • Similarly, ROUGE-1 Precision is the number of unigram matches divided by the total unigrams in the model’s summary.
  • Just like regular recall and precision, the ROUGE metrics can be combined using harmonic mean to get the f1-score

But there are issues with ROUGE-1 metric. It only measures the overlap of unigrams between a generated summary and a reference summary. A perfect ROUGE-1 score means that all the individual words in both summaries are the same. However, this doesn’t necessarily mean that the generated summary is coherent or meaningful.

--

--

No responses yet