Member-only story

Large Language Model Evaluation Metrics

Derrick Mwiti
Towards AI
Published in
6 min readOct 26, 2024

Photo by Possessed Photography on Unsplash

The most common evaluating metrics for large language models are:

  • Perplexity
  • BLEU
  • ROUGE
  • BERTScore
  • COMET
  • METEOR
  • BLEURT
  • GPTScore
  • PRISM
  • BARTScore
  • G-Eval
  • Human Evaluation

Evaluating large language models(LLMs) is extremely difficult due to the fact that they can perform a myriad of tasks.

Perplexity

Perplexity measures how good a model is at predicting the next word. The lower the score the better, hence a higher score means the model is performing poorly at coming up with the next word. Therefore, the objective is to minimize the language model’s perplexity. The English synonym for perplex is baffle or confuse. Hence, a model that’s good at predicting the next token is not baffled or confused.

The perplexity metric is better suited for auto-regressive models that generate text than masked language models such as BERT used for classification. The metric is computed as the exponentiated average exponential log-likelihood of a sequence. Since perplexity measures how well the model predicts the…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

No responses yet

What are your thoughts?