AI Snake Oil

Share this post

Evaluating LLMs is a minefield

www.aisnakeoil.com

Evaluating LLMs is a minefield

Annotated slides from a recent talk

Arvind Narayanan
and
Sayash Kapoor
Oct 04, 2023
87
Share this post

Evaluating LLMs is a minefield

www.aisnakeoil.com
6
Share

We have released annotated slides for a talk titled Evaluating LLMs is a minefield. We show that current ways of evaluating chatbots and large language models don't work well, especially for questions about their societal impact. There are no quick fixes, and research is needed to improve evaluation methods.

The challenges we highlight are somewhat distinct from those faced by builders of LLMs or by developers interested in comparing between LLMs for adoption. Those challenges are better understood and tackled by evaluation frameworks such as HELM. 

You can view the annotated slides here.

The slides were originally presented at a launch event for Princeton Language and Intelligence, a new initiative to strengthen LLM access and expertise in academia.

You’re reading AI Snake Oil, a blog about our upcoming book. Subscribe to get new posts.

The talk is based on the following previous posts from our newsletter:

  • Is GPT-4 getting worse over time?

  • Does ChatGPT have a liberal bias?

  • Generative AI companies must publish transparency reports

  • GPT-4 and professional benchmarks: the wrong answer to the wrong question

  • ML is useful for many things, but not for predicting scientific replicability

  • OpenAI’s policies hinder reproducible research on language models

  • Licensing is neither feasible nor effective for addressing AI risks

  • Is the future of AI open or closed?

Subscribe to AI Snake Oil

Launched 2 years ago

What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference

87 Likes
·
6 Restacks
87
Share this post

Evaluating LLMs is a minefield

www.aisnakeoil.com
6
Share
6 Comments
Dan
Oct 5, 2023Liked by Arvind Narayanan

This is a great deck on multiple levels. The information is extremely helpful but almost more impressive is that it is presented in such a clear, easy-to-follow style. Thank you for sharing it.

Expand full comment
Like (2)
Reply
Share
MrDecentralize
Decentralize Your Life
Oct 8, 2023

Good topics

Expand full comment
Like
Reply
Share
4 more comments...
GPT-4 and professional benchmarks: the wrong answer to the wrong question
OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots.
Mar 21, 2023 • 
Arvind Narayanan
 and 
Sayash Kapoor
127
Share this post

GPT-4 and professional benchmarks: the wrong answer to the wrong question

www.aisnakeoil.com
22
AI scaling myths
Scaling will run out. The question is when.
Jun 28 • 
Arvind Narayanan
 and 
Sayash Kapoor
191
Share this post

AI scaling myths

www.aisnakeoil.com
37
Is GPT-4 getting worse over time?
A new paper going viral has been widely misinterpreted
Jul 20, 2023 • 
Arvind Narayanan
 and 
Sayash Kapoor
119
Share this post

Is GPT-4 getting worse over time?

www.aisnakeoil.com
12

Ready for more?

© 2024 Sayash Kapoor and Arvind Narayanan
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great culture
Share

Create your profile

undefined subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.