Member-only story

Evaluation of LLM Models — A general Survey

KEEP IN TOUCH | THE GEN AI SERIES

9 min readJan 19, 2024

TASK SPECIFIC EVALUATION

SUMMARIZATION

A popular family of metrics used to assess text summarization is ROUGE or Recall-Oriented Understudy for Gisting Evaluation. ROUGE compares the model-generated text summary to the “golden”, human-written reference summary.

The simplest ROUGE metrics are ROUGE-1 Recall and ROUGE-1 Precision. To calculate them, we count the number of unigrams (words) that match between the two summaries.

ROUGE-1 Recall is the number of unigram matches divided by the total number of unigrams in the reference summary.
Similarly, ROUGE-1 Precision is the number of unigram matches divided by the total unigrams in the model’s summary.
Just like regular recall and precision, the ROUGE metrics can be combined using harmonic mean to get the f1-score

But there are issues with ROUGE-1 metric. It only measures the overlap of unigrams between a generated summary and a reference summary. A perfect ROUGE-1 score means that all the individual words in both summaries are the same. However, this doesn’t necessarily mean that the generated summary is coherent or meaningful.

Evaluation of LLM Models — A general Survey

KEEP IN TOUCH | THE GEN AI SERIES

TASK SPECIFIC EVALUATION

Written by Rahul S

No responses yet