Evaluating LLMs is a minefield

Annotated slides from a recent talk

and

Oct 04, 2023

We have released annotated slides for a talk titled Evaluating LLMs is a minefield. We show that current ways of evaluating chatbots and large language models don't work well, especially for questions about their societal impact. There are no quick fixes, and research is needed to improve evaluation methods.

The challenges we highlight are somewhat distinct from those faced by builders of LLMs or by developers interested in comparing between LLMs for adoption. Those challenges are better understood and tackled by evaluation frameworks such as HELM.

You can view the annotated slides here.

The slides were originally presented at a launch event for Princeton Language and Intelligence, a new initiative to strengthen LLM access and expertise in academia.

The talk is based on the following previous posts from our newsletter:

Subscribe to AI Snake Oil

Launched 2 years ago

What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference

87 Likes

6 Restacks

6 Comments

Dan

Oct 5, 2023Liked by Arvind Narayanan

This is a great deck on multiple levels. The information is extremely helpful but almost more impressive is that it is presented in such a clear, easy-to-follow style. Thank you for sharing it.

Expand full comment

Like (2)

MrDecentralize

Decentralize Your Life

Oct 8, 2023

Good topics

4 more comments...