AI Interpretability

Rigorously understanding how ML models function may allow us to identify and train against misalignment. Can we reverse engineer neural nets from their weights, or identify structures corresponding to “goals” or dangerous capabilities within a model and surgically alter them?

Mentors

Neel Nanda
Research Engineer, Google DeepMind

Neel leads the mechanistic interpretability team at Google DeepMind, focusing on reverse-engineering the algorithms learned by neural networks to differentiate between helpful and deceptively aligned models and better understand language model cognition.

Adrià Garriga-Alonso
Research Scientist, FAR AI

Adrià is a Research Scientist at FAR AI focusing on advancing neural network interpretability and developing rigorous methods for AI interpretability.

Arthur Conmy
Research Engineer, Google DeepMind

Arthur’s research focuses on discovering and innovativing methods for automating interpretability and applying model internals to critical safety tasks.

Lee Sharkey
Chief Strategy Officer, Apollo Research

Lee is Chief Strategy Officer at Apollo Research. His main research interests are mechanistic interpretability and “inner alignment.”

Nina Panickssery
Member of Technical Staff, Anthropic

Open to a variety of projects in LLM interpretability and adversarial robustness.

Hidenori Tanaka
Group Leader, Harvard/NTT Research

Hidenori Tanaka leads a Science of AI for Alignment group at CBS-NTT Program in Physics of Intelligence at Harvard University, where he integrates concepts and scientific methods from physics, neuroscience, and psychology to advance our understanding of AI models for alignment and safety.