Member-only story
Evaluation of LLM Models — A general Survey
TASK SPECIFIC EVALUATION
SUMMARIZATION
A popular family of metrics used to assess text summarization is ROUGE or Recall-Oriented Understudy for Gisting Evaluation. ROUGE compares the model-generated text summary to the “golden”, human-written reference summary.
The simplest ROUGE metrics are ROUGE-1 Recall and ROUGE-1 Precision. To calculate them, we count the number of unigrams (words) that match between the two summaries.
- ROUGE-1 Recall is the number of unigram matches divided by the total number of unigrams in the reference summary.
- Similarly, ROUGE-1 Precision is the number of unigram matches divided by the total unigrams in the model’s summary.
- Just like regular recall and precision, the ROUGE metrics can be combined using harmonic mean to get the f1-score
But there are issues with ROUGE-1 metric. It only measures the overlap of unigrams between a generated summary and a reference summary. A perfect ROUGE-1 score means that all the individual words in both summaries are the same. However, this doesn’t necessarily mean that the generated summary is coherent or meaningful.