Applications for all Summer 2025 streams are now open!
Applications due April 18
Application process
Steps in the application process
Create an application account. You’ll use this to access all applications materials.
Submit a MATS pre-application by April 18. This is required by all streams.
Submit applications to the MATS stream(s) you want to work with. You must submit at least one stream-specific application for your MATS application to be considered. You can and should apply to all of the MATS streams that interest you! Most stream applications will be due April 18. See a list of the streams and their applications below.
Complete additional evaluations. Depending on the streams you apply to, you may be required to complete a coding screen, interviews, or other evaluations after submitting your application. The process, however, is not standardized between streams; not being contacted for an interview does not necessarily mean that your application is not in consideration.
Tips for applying
Make sure to check your spam folders for emails! You may wish to automatically filter for emails from applications@matsprogram.org to ensure you don’t miss any emails.
Submit your application materials early. In the past, some applicants have had technical problems in the hour leading up to the application deadline. Additionally, applications are reviewed on a rolling basis.
Mentors will primarily evaluate candidates based on the submission of their own stream-specific applications, though all mentors will have access to application materials submitted to other streams.
Summer 2025 Tracks
To decide which mentor(s) to apply to, applicants are able to filter below by track and research interest. We recommend reading through the different streams’ proposed research projects and mentorship style to assess personal fit.
Click on each track title to read a brief description.
-
As model develop potential dangerous behaviors, can we develop and evaluate methods to monitor and regulate AI systems, ensuring they adhere to desired behaviors while minimally undermining their efficiency or performance?
-
Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?
-
Rigorously understanding how ML models function may allow us to identify and train against misalignment. Can we reverse engineer neural nets from their weights, or identify structures corresponding to “goals” or dangerous capabilities within a model and surgically alter them?
-
As AI systems continue to advance and develop even stronger capabilities, can we develop policies, standards, and frameworks to guide the ethical development, deployment, and regulation of AI technologies, focusing on ensuring safety and societal benefit?
-
As models continue to scale, they become more agentic and, as such, we need methods to study their newfound agency. How do we study modeling optimal agents, how those agents interact with each other, and how some agents can be aligned with each other?
-
As AI systems become more capable, they will become higher-value targets for theft, and more able to undermine cybersecurity protocols. How do we ensure that the weights of valuable ML models remain under the control of developers, and that AI improves rather than degrades the state of cybersecurity?
Summer 2025 Streams
A previous version of this website had a stream for just Nicholas Carlini. If you want to work with Nicholas Carlini, please apply to this stream instead.
Oversight & control (8 streams)
Jason Gross, Rajashree Agrawal
Software verification has been labor bottlenecked, often requiring PhD-level engineers to write 10x - 100x more verification code than the code being verified. Now, progress in advanced mathematical reasoning can be leveraged for progress in software verification. Our work is geared at using formal methods to build provably robust oversight methods for AI-generated software.
There are two main directions for projects: verifying existing critical software stacks (government, AI, and financial infrastructure), and developing scalable oversight for AI-driven software development. Here are some example projects that scholars might work on:
- Measuring software verification capability of frontier models. For example, we’re working on projects in translating CompCert from Coq to Lean, which helps to uncover challenges in real world use of LLMs for software verification, and developing a benchmark suite based on the Software Foundations textbook to track progress in frontier models capability.
- Formalizing the ML tech stack using AI assistants. For example, we’re working on formalizing StableHLO, so that we can prove security properties of custom backends for optimizing machine learning workloads.
Representative Papers
Mentorship Process
Scholars will have weekly group meeting to present results and sync on next steps. We will pair program more in the first few weeks, and move on to more higher level discussions as things progress.
Meeting Frequency
Project Selection Process
Preferred Characteristics
The concrete work you’ll do will be some combination of evaluation, finetuning, and reasoning about programs and security. You’ll be a good fit if the following sounds like you, or someone you’re excited to become.
- Strong programming fundamentals. You should be comfortable reading and writing code in multiple languages, thinking about code design, and reasoning about code correctness and specifications.
- Deep experience with using AI agents for coding, understanding their strengths and weaknesses. You should enjoy prompt engineering, iterating on agent scaffolds, finetuning models for hobby projects, etc.
- Bonus points for enjoying working on “systems” projects. For example, experience with programming languages research (OCaml/Haskell, Rust, LLVM IR) or software verification (Coq, Lean, etc.), or working on machine learning infrastructure such as training frameworks (PyTorch, TensorFlow, JAX) or accelerator backends (CUDA, TPU).
I expect to spend 2-3 hours per week with each scholar through a combination of weekly meetings and asynchronous communication. Scholars will have access to compute necessary for their projects.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your experience with software development using AI? (Please keep your answers brief. Bullet points are fine, share links if possible.)
Examples might include using Cursor to build projects, reviewing LLM-generated PRs, or running evaluations on coding agents.
What’s the most technically impressive project you’ve built? (Please keep your answers brief. Bullet points are fine, share links if possible.)
We’re particularly interested in projects involving ML, security, systems, or math that demonstrate your ability to work across the tech stack with full-system understanding.
What is a specific fact about your worldview / aesthetics / skills that makes you want to work on this project? Alternatively, this is a space to share whatever else seems relevant to your application.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Samuel Albanie
This stream will focus primarily on projects relating to AI monitoring and control.
There is some flexibility, and I’m open to scholar proposals. However, I’m primarily interested in research that relates to monitor and control measures for oversight of complex, agentic tasks. For examples in this direction, see the papers below.
Representative Papers
- AI control: arXiv:2312.06942
- Distributed control: arXiv:2411.17693
- Sabotage evals: arXiv:2410.21514
- Subversion evals: arXiv:2412.12480
Mentorship Process
Project structure:
- The project will (ideally) be collaborative (with at least 2 individuals working on it full time).
- We will hold weekly meetings (ramping up if/when we approach a research output). My slack response time is typically <= 24 hours.
Meeting Frequency
Project Selection Process
Preferred Characteristics
You are likely a good fit for this collaboration if you thrive in a scenario that involves iterating quickly with empirical experiments. This style of research typically involves many cycles of prototyping and testing in Python (often with LLM APIs to maximize speed). Consequently the following are requirements:
- Several years of ML engineering or research experience;
- The ability to own and execute a project, taking the initiative to unblock technical challenges proactively when they arise;
- The ability to get to grip with relevant research quickly (primarily by reading papers).
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
In a Google doc, write a project proposal (at most one page) for a project related to AI control or monitoring. Paste a link for the Google doc below. Make sure that the doc is publicly viewable.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Ethan Perez, Buck Shlegeris, Samuel Marks, Joe Benton, Evan Hubinger, Mrinank Sharma, Fabien Roger, Kyle Fish, Stephen McAleer, Nicholas Carlini
The Anthropic-Redwood stream spans a range of empirical research areas in AI safety on LLMs, ranging from AI control to scalable oversight and model organisms. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Scholars in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic).
This stream is focused on reducing catastrophic risks from large language models (LLMs). Their research spans several areas:
- Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
- Developing techniques for process-based supervision, such as learning from languagefeedback.
- Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy or reward-tampering or alignment-faking).
- Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences or red teaming with best-of-n jailbreaks)
- Investigating the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
- Scalable oversight – the problem of supervising systems that are more capable than human overseers
- Advancing security through investigating adversarial machine learning, cybersecurity evals, and understanding currently possible real-world attacks
These projects involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.
Representative Papers
- Alignment faking in large language models (arXiv:2412.14093)
- Constitutional Classifiers (arXiv:2501.18837)
- Debating with More Persuasive LLMs Leads to More Truthful Answers (Best paper, ICML 2024; arXiv:2402.06782)
- Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models (arXiv:2406.10162)
- Best-of-N Jailbreaking (arXiv:2412.03556)
Mentorship Process
Mentorship starts with the “Project Pitch Session” Anthropic runs at the start-of/before the program. During this session, dozens of researchers from Anthropic and other AI Safety orgs pitch projects they’d be excited to work on. Scholars get ~1 week to derisk and trial projects before submitting their preferences. Starting on week 2, scholars are assigned projects where the primary mentor is whoever pitched it (e.g. Ethan, Buck S, Evan, etc.). Some projects are assigned co-mentors who are other supervisors who want to join the project.
During the program, scholars meet weekly with their project mentors and collaborators. Some projects meet more often without mentors (e.g., daily standups with the peers on the project). Each project will have primary mentor, who is also the main decision-maker on key milestones for the project and who is the default person to go to for feedback, advice, etc. Co-mentors also attend project meetings as needed and provide feedback throughout the program. Some project co-mentors can be as involved as the primary mentor.
Meeting Frequency
Project Selection Process
Preferred Characteristics
See the top of this post: https://www.alignmentforum.org/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
Generally someone who can run a lot of experiments quickly.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Please provide the name, email, and relation for 1-2 references.
If possible, suggest referees involved in AI Safety.
Is there anything else we should know? (If the rest of your application speaks for itself, feel free to leave this blank.)
How excited are you to pursue a well-scoped research project (vs. pursue a project that you scope out)? (~1 sentence)
As a group, we have a pretty strong sense of what research is important – but are pretty open to working on a wide range of directions, but people who are working with us would need to be able to be guided by our feedback on what kinds of projects are important to work on.
If you feel like you already have strong research taste and takes on what to work on, in ways that seem to be different than mine (based on our past work), it’s plausibly not a great fit.
What are your odds of being interested in continuing to work together with this stream full-time beyond the ~2-month MATS program?
I’m currently only seeking applications from people who are at least 25% likely to want to continue working together full-time post-MATS (e.g., 4-6 additional months post-MATS until the research project runs to completion). (Include a %, possibly with some explanation of your number as needed.)
In what ways are you opinionated on what you work on (if any)? (~1-3 sentences)
What programming languages are you fluent in? (1 sentence)
Please provide a rough estimate of how many hours you’ve spent programming in each language you list.
Please talk briefly about an area of technical work right now you’re most interested in or excited about. (~3-5 sentences)
Day-to-day, what are you looking for in your next role, and what’s your ideal breakdown of your time in a working week, in terms of hours per week spent on research meetings & discussions, coding, reading papers, brainstorming, etc.? (~3-5 sentences)
What kinds of skills in your teammates would best complement yours (or have best complemented yours in the past)? (~3-5 sentences)
Can you describe a paper you’re excited about and say why it’s exciting? (~3-5 sentences)
What’s a technical achievement you’re proud of? (~3-5 sentences)
Please share links to any relevant public information if applicable.
Why are you interested in participating in this stream? Which of this streams’ mentors would you be most excited to be directly supervised by? (~3-5 sentences)
You can expect to work with all of the mentors in this stream, but it helps us to onboard you if we know what sorts of projects you’d be most excited to work on.
[Optional] Please share code samples from past projects (e.g. GitHub links or uploaded zip files), ideally a substantial project and (if possible) a machine learning project (these could be the same or different projects).
[Optional] Other links to information about you and your technical contributions (e.g., personal website, LinkedIn, GitHub, Google Scholar, and/or blog posts).
SAT/ACT scores. Please share your highest total scores for each test. If you can’t answer this question, please provide a 1-sentence explanation on why (e.g., why you haven’t taken the SAT or ACT, or why you’re unable to find your scores).
Undergraduate major and GPA / final grade. Please also include the percentile (or rank) in your class that this corresponds to, if this is straightforward to look up or estimate. If you haven’t started an undergraduate degree, please state so instead of providing a GPA.
GRE scores: Please provide all 3 scores, in the following format: [verbal reasoning score] / [quantitative reasoning score] / [analytical writing score]. If you haven’t taken the GRE, please state so instead.
[Optional] Other standardized test scores: If there are other standardized test scores you would like to share, feel free to list them here.
[Optional] If you applied to work with this stream for the last MATS cohort (~6 months ago) and were rejected, it would be optional but helpful if you are able to elaborate here to highlight changes/additions/improvements to your application and experience since you last applied.
[Optional] Is there anything else you would like to share?
Check this box if you have applied to Anthropic in the last year.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Micah Carroll
Studying and demonstrating manipulative behaviors which emerge due to RL and/or mitigating such behaviors. Studying CoT faithfulness and how it’s affected by different forms of training.
Projects I’d be particularly interested in having MATS scholars work on span the following topics:
- Reward hacking in LLMs trained with RLAIF or reasoning approaches: are misleading behaviors emergent from RL (such as those in this paper) going to be present with reasoning models (and in RLAIF models)? If so, how? The goal would be to have a good misalignment demo and try to find mitigations.
- We have found that RL can heavily distort CoT reasoning, leading to extreme “motivated reasoning”. We’re interested in studying this phenomenon in more depth. How robust is the emergence of motivated reasoning to different training setups? May this partially explain why we see scheming behaviors or alignment faking? Some reasons why motivated reasoning may be a possible explanation are here.
- AI alignment under changing and influenceable values and preferences: we want to align AI systems to values and preferences that can change (and be changed by AI systems). Doing so is hard. Can we leverage LLM’s knowledge of in this domain (potentially optimizing against them as reward models)? Can we use them to better elicitmeta-preferences, and try to reduce naturally emerging influence incentives?
I’d also be open to scholars proposing their own projects.
Representative Papers
- On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback (arXiv:2411.02306): done during MATS 6.0 and representative of the kinds of empirical work I find compelling
- Some thoughts on alignment faking
- AI Alignment with Changing and Influenceable Reward Functions (arXiv:2405.17713): conceptual foundations for issues of emergent manipulation from RL, and discussion of open challenges.
Mentorship Process
What mentorship will look exactly will depend a bit on my availability and what project scholars end up working on. At the very minimum, I’d have 1 hour a week group meetings with scholars, and would have quick turnaround (<24h) on slack messages. I’d also be open to meeting much more frequently, insofar as there are things to discuss and progress being made. As an upper bound, my involvement would look like collaborating closely with my scholars for an average of 15h/20h a week (as I did with MATS 6.0).
I’m happy to provide references of previous scholars so that you can hear their experience in working with me.
Meeting Frequency
Project Selection Process
Preferred Characteristics
I’d be particularly excited by candidates who have one or more of the following:
- A strong machine learning engineering background
- Previous research experience, ideally in machine learning or related fields
- Experience with human subjects experiments is a nice to have
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
Please provide the name, email, and relation for 1-2 references.
If possible, suggest referees involved in AI Safety.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Joshua Clymer
Alignment evaluations and control evaluations (e.g. evals for collusion, white-box techniques, etc)
My MATS stream is putting interp (and other white box tools) to the test. We’re going to train hundreds of misaligned models in a red-team/blue-team capture-the-flag game.
See more here: https://www.lesswrong.com/posts/jWFvsJnJieXnWBb9r/alignment-faking-ctfs-apply-to-my-mats-stream
Representative Papers
Some materials related to the above:
- Catching AIs red handed: https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed
- How to catch a liar: arXiv:2309.15840
- Representation engineering: arXiv:2310.01405
Mentorship Process
I’ll probably talk to each of my scholars approximately once every 1 - 2 days. The routine will generally be “we talk about your results and what you’ve tried” then “we talk about what to do next.” I think this will be a great way to learn research skills for many people.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Strong software engineering skills, good at ‘getting things to work,’ good at being paranoid about what their results mean
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Conditional on scheming / deceptive alignment arising naturally (i.e. not in a model organism / demo), how capable do you think models will be when it first emerges, and why? (250 words max)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Marius Hobbhahn
This stream is generally focused on projects related to scheming. In this cohort, we will build a black-box monitor for scheming in complex agentic settings.
The goal of this project is to build a great LM agent scheming black-box monitor that works in general settings with large action spaces and for a very general definition of scheming. You can find a detailed project description here: https://docs.google.com/document/d/1FDgh4ioygjqVionP3i8i163zaXT9hqAV-gEikCNhlTE/edit?usp=sharing
Representative Papers
- AI Control: Improving Safety Despite Intentional Subversion (arXiv:2312.06942): as a general intro to control and a reference for which techniques have already been tried.
- Shell Games: Control Protocols for Adversarial AI Agents: as control work in agentic setting (note, this is an old version)
- Towards evaluations-based safety cases for AI scheming (arXiv:2411.03336): for general context on why we want to build monitors to begin with
- Frontier Models are Capable of In-context Scheming (arXiv:2412.04984): for more context on scheming and potential ideas for environments.
- Alignment faking in large language models (arXiv:2412.14093): for more context on scheming and potential ideas for environments
- Detecting Strategic Deception Using Linear Probes (arXiv:2502.03407): for ideas about measurements and general understanding of white-box measurements.
- Coup probes: Catching catastrophes with probes trained off-policy: for ideas about measurement and general context
https://docs.google.com/document/d/1FDgh4ioygjqVionP3i8i163zaXT9hqAV-gEikCNhlTE/edit?usp=sharing if on-text links don’t work
Mentorship Process
We have two weekly 60-minute calls by default. Since everyone will work on the same project, these calls will be with all participants of the stream. I respond on slack on a daily basis for asynchronous messages. Scholars will have a lot of freedom for day-to-day decisions and direction setting. In the best case, you will understand the project better than me after a few weeks and have a clear vision for where it should be heading. I recommend scholars focus 100% of their work time on the project and not pursue anything on the side. I think this way people will learn the most in MATS.
Meeting Frequency
Project Selection Process
Preferred Characteristics
- You are interested in this project in particular, i.e. building black-box monitors for scheming in agentic settings.
- You enjoy tinkering with LLMs, e.g. prompting or building basic LM agents.
- You like quick empirical iteration and direct feedback loops.
- I expect that you will spend 20% on conceptual work (e.g., think which environments could work or what techniques to try) and 80% on hands-on empirical work (e.g., implementing and running experiments).
- I prefer if scholars focus 100% of their work time on the project and not pursue any side projects. In general, I’m happy to support highly ambitious scholars who want to make a lot of progress during MATS.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
Write a project proposal for the scheming monitoring project (https://docs.google.com/document/d/1FDgh4ioygjqVionP3i8i163zaXT9hqAV-gEikCNhlTE/edit?usp=sharing). The project proposal should be maximally one page (ca. 400 words) excluding references. I expect that the best applications will spend 3-5 hours on this and do a brief empirical investigation in addition to writing. Figures and empirical results are welcome. I encourage you to write it in a Google Doc and copy-paste the full link in the response (make sure the sharing setting is correct and copy-paste the full link since links on text do not work; be careful because links that include the character “_” twice will make stuff italic). If you don’t have the time for a detailed proposal, feel free to write a 100 word description of your proposal.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Scott Emmons, David Lindner, Erik Jenner
This stream will focus on monitoring, stress-testing safety methods, and evals, likely with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, as well as building evaluation environments for all of these.
Since the field moves quickly, it’s too early to list exact projects. But here are rough directions to give some flavor:
- Creating environments that can be used to evaluate and stress-test monitoring techniques (such as e.g. the APPS AI control environment)
- Creating model organisms of dangerous behavior (e.g., https://arxiv.org/abs/2501.13011 finetune a model to execute bad policies in the type of environment mentioned above, rather than requiring human-designed scaffolds)
- Building evals for capabilities related to scheming (such as Subversion Strategy Eval)
- Adversarially stress-testing monitoring techniques (e.g. extensions of Obfuscated Activations for white-box monitors)
- Improved white-box monitoring methods
- Chain of thought monitoring (e.g. building/testing monitors, studying faithfulness/steganography, testing for which tasks CoT is needed or helps improve performance)
We’ll discuss projects in areas like this with scholars at the beginning of MATS. So we encourage you to apply if a decent fraction of these directions (or the general theme) sound exciting to you; you do not need to be excited about every single potential project.
Representative Papers
- Obfuscated Activations Bypass LLM Latent-Space Defenses arXiv:2412.09565
- Subversion Strategy Eval: Evaluating AI’s stateless strategic capabilities against control protocols arXiv:2412.12480
- AI Control: Improving Safety Despite Intentional Subversion arXiv:2312.06942
- Shell Games: Control Protocols for Adversarial AI Agents https://openreview.net/forum?id=oycEeFXX74
- A Toy Evaluation of Inference Code Tampering https://alignment.anthropic.com/2024/rogue-eval/
Mentorship Process
We will most likely have a joint project selection phase, where we present a list of projects (with the option for scholars to iterate on them). Afterward, each project will have at least one main mentor, but we might also co-mentor some projects.
We design our stream to be highly collaborative. We encourage scholars to work in pairs or larger teams and possibly with external collaborators.
We’ll have a weekly meeting for each project to discuss the overall project direction and prioritize next steps for the upcoming week. On a day-to-day basis, you will discuss experiments and write code with other mentees on the project (though we are available on Slack for quick feedback between meetings or to address things that are blocking you).
Meeting Frequency
Project Selection Process
Preferred Characteristics
We are looking for scholars with strong software and machine learning engineering skills, as well as a background in technical research. While we’ll provide weekly guidance on research, we expect scholars to be able to run experiments and decide on low-level details fairly independently most of the time. We’ll also encourage collaboration between scholars, so good communication and team skills are important. Finally, we will have a strong preference for building a team that works together in-person.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
[≤ 200 words; answering this is optional but recommended] What’s an AI safety paper/post you’re excited about? (doesn’t need to be your own) Link it, explain why you’re excited about it, and describe at least one weakness or limitation you still see. Focus on explaining your view, not summarizing the paper.
Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max]
If you know anyone who could provide a reference for you, please list name, email, and relation to you for 1-2 such people.
References from people working in AI safety who know you well are most informative to us, but we also encourage you to apply if you don’t know anyone like that.
Are you able to work out of Berkeley during the program? Please respond:
(A) if the answer is “Yes, I am a US citizen and/or have work authorization”
(B) if the answer is “Yes, I will apply to get a US visa or ESTA and expect to get one”
(c) if the answer is “No”, and explain where you would work out of.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Tomek Korbak
I think AI control is the most tractable approach to reducing risks from misaligned AI. I’m excited to work with mentees interested in empirical projects building and evaluating control measures for LLM agents. An ideal project ends with a paper submitted to NeurIPS/ICML/ICLR.
UK AISI might be able to provide additional financial, logistics and engineering support to projects.
I’m mostly interested in empirical projects around control and security measures for LLM agents, automated red teaming, and capability and alignment evals for LLM agents. Some ideas of projects I might be interested in:
- How does the offense-defense balance between malicious LLM agents and monitors shift as we increase model capability or inference-time compute?
- How to build and evaluate agentic monitors that take actions to investigate whether a given trajectory of a suspicious LLM agent is malicious?
- A human study of how to use a limited budget of human audits when overseeing an LLM agent
- How far can blackbox and whitebox AI control monitors generalize?
- How faithful does a chain-of-thought reasoning trace have to be enable oversight via chain-of-thought monitoring?
- How capable are LLMs of latent reasoning (without chain-of-thought) and how does it scale with model size?
- Do LLM agents have consistent goals? Evaluations of LLM agents propensity for different behaviors and other topics in blackbox LLM psychology.
Representative Papers
- For my current thinking on what an ideal control evaluation done by a frontier lab should look like, see A sketch of an AI control safety case (arXiv:2501.17315). Note that this is quite different from how a MATS project should proceed.
- Towards evaluations-based safety cases for ai scheming (arXiv:2411.03336)
- AI Control: Improving Safety Despite Intentional Subversion (arXiv:2312.06942)
- Eliciting Language Model Behaviors with Investigator Agents
- LLM Critics Help Catch LLM Bugs (arXiv:2407.00215)
Mentorship Process
I’ll meet with mentees once a week and will be available on Slack daily. By default, I’ll try to pair mentees to work on a project together.
Meeting Frequency
Project Selection Process
Preferred Characteristics
An ideal mentee has a strong AI research and/or software engineering background. A mentee can be a PhD student and they can work on a paper that will be part of their thesis.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Evaluations (5 streams)
Oliver Sourbut (Oly), Sid Black
We are interested in supporting scholars who evaluate autonomy and loss of control risks from AI systems. Our focus areas include sandbagging and deception, long-horizon agentic evaluations, control evaluations, and predicting future evaluation results. We strongly prefer scholars work out of London, but this is not an absolute deal-breaker.
We are interested in supporting scholars building evaluations or conducting research projects that measure autonomy and loss of control risks from AI systems. This includes, but is not limited to, threat modelling work, developing evaluations, as well as more focused research projects. Our focus areas include sandbagging and deception, complex; long-horizon agentic evaluations, multi-agent evaluations, and control evaluations. Additionally, we value any research projects that will help us be better informed about future progress in autonomy capabilities - e.g. those which assess barriers to autonomy, identify potential autonomy overhangs, or help predict models’ future capabilities as they relate to autonomy capabilities.
Representative Papers
- arXiv:2305.15324 - Model evaluation for extreme risks
- arXiv:2403.13793 - Evaluating Frontier Models for Dangerous Capabilities
- https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
- arXiv:2405.10938
- Van der Weij et al., 2024: An introduction to AI sandbagging
- Witten et al., 2025: Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models
- Van der Weij et al., 2024: AI Sandbagging: Language Models can Strategically Underperform on Evaluations (arXiv:2406.07358)
- Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arXiv:2502.15840)
Mentorship Process
Both Sid and Oly are willing to devote a few hours per week to this, so the overall scheme will depend on how many successful applicants there are.
Meeting Frequency
Project Selection Process
Preferred Characteristics
- experienced software engineers
- unusual backgrounds and domain expertise relevant to specific threats (e.g. dark web, networking, deception e.g. sociology skills) encouraged for threat modelling
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream.
What’s your motivation for applying to this stream and project? (max 100 words)
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Owain Evans
We empirically research topics related to emergent misalignment, self-awareness for LLMs, and faithfulness in reasoning models. I have mentored 35+ researchers in AI safety, and past MATS projects have resulted in papers “Emergent Misalignment”, “The Reversal Curse” and many other papers. My past mentees (https://www.truthfulai.org/about#mentees) have gone on to work at UK & US AISIs, Anthropic, Apollo Research, OpenAI, Transluce, GDM, etc, as well as joining my group full-time.
Note: We will send out an assignment to selected candidates during the candidate evaluation phase. This assignment will involve coding and writing a short research experiment. Please be able to spend 1-2 days to complete the assignment. Because of this, we encourage applicants to apply earlier so that we can evaluate them on a rolling basis.
- Emergent Misalignment. We’ve recently published a well received paper about Emergent Misalignment. This is a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. There are some open problems we want to investigate.
- Faithfulness. Do models explain factors that influence their decisions? See our latest paper on faithfulness in reasoning models (e.g. DeepSeek).
- Steganography. Can models encode hidden information that looks “normal” to humans?
- Detecting deception. Through black-box means, can we detect when models “lie” to us? See our paper on catching an AI liar. See also Reversal Curse.
- Limits of multi-hop and out-of-context reasoning. To what extent can models reason without step-by-step reasoning? This can inform us of the extent of the dangerous capabilities of the models. See our paper on connecting the dots.
- Situational Awareness. How aware are current models of themselves and their environment? See our paper on the Situational Awareness Dataset.
Representative Papers
See TruthfulAI’s website for my previous papers. You can see my past mentees here.
Mentorship Process
I will meet with you at least an hour per week. We will discuss the project directions and ways to improve experiments towards writing a paper.
Depending on the topic and your level of experience, you may also work with my colleagues Jan Betley and James Chua. They have experience in running technical ML experiments and have published ML papers with me.
Meeting Frequency
Project Selection Process
Preferred Characteristics
One of below
- Experience running ML experiments in industry (2-3 + years)
- Experience as a first author on ML conference papers.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Please provide the name, email, and relation for 1-2 references.
If possible, suggest referees involved in AI Safety.
What are your chances of being interested in continuing to work together with Owain full-time beyond the ~2-month MATS program? E.g. You are funded for an extension of 2-4 months until the project is completed. Include a percentage, with some explanation of your number as needed.
Are you a US citizen or permanent resident? Or do you have a work or student visa for the US? (Note: We are very happy to take non-US persons for MATS but it is helpful for our planning to know if someone can remain in the US after MATS is completed.)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Mary Phuong
This stream will run in-person in London, with scholars working in pairs or small groups. Before the program starts, I’ll share a few project proposals with potential scholars for consideration. Topic wise the focus will be on scheming precursor capabilities (and how to measure them) and control protocols (resolving design uncertainties, prototyping, red-teaming).
I will share more detailed project proposal docs closer to the program start. Example research projects I’m excited about right now (these may change):
- Building a comprehensive “scheming thoughts” dataset for evaluating chain-of-thought monitors
- Developing a methodology for chain-of-thought monitoring control evaluations
- Building evaluations for scheming precursor capabilities, e.g. (opaque) stealth reasoning, cross-instance coordination
- Developing honeypots for upfront audits
Mentorship Process
As a baseline, we’ll meet once a week, where we go through any updates / results, and your plans for the next week. I’m happy to comment on docs, respond on Slack, or have additional meetings as needed.
Meeting Frequency
Project Selection Process
Preferred Characteristics
- Python programming: You should be comfortable writing high-quality code independently and quickly. This is what you’ll spend most of your time doing, plus your codebase might end up being the main output of the project.
- Familiarity with deceptive alignment, safety cases, capability evaluations, control, and related topics. This makes it much easier for us to be on the same page about the goals of the project, and is important for making day-to-day project decisions in a conceptually sound way.
- Communication: You should be able to clearly communicate your research ideas, experimental methodology, results, etc, verbally or in writing (writing is more important). I also appreciate mentees being able to notice and express confusion, uncertainty, disagreement, or dissatisfaction so we can work through things together.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
[≤ 200 words; answering this is optional but recommended] What’s an AI safety paper/post you’re excited about? (doesn’t need to be your own) Link it, explain why you’re excited about it, and describe at least one weakness or limitation you still see. Focus on explaining your view, not summarizing the paper.
Are you open to working in a small team? We plan to prioritize people that answer “yes” to this question, although it is not a strict requirement.
If the project is going well, and you receive funding to continue your work afterward, do you expect to choose to participate in the MATS extension (between 6 and 12 months after the main program)? This could be remote. It’s okay if you aren’t sure if you’ll be available for the entire period.
Would you be available for in-person collaboration in London for the duration of the program?
How much time do you anticipate being able to spend on your MATS project for the duration of the program?
Pick an example project from the ones listed in the stream description and projects, or propose your own in the same general research area. Write a concrete outline of how you would tackle the project in 150-250 words. What are the riskiest parts of the plan? What would you prioritise doing in the first week? Please be as concrete and specific as possible. (To clarify, you are not committing to a particular project here, it’s just an exercise.)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Francis Rhys Ward
Currently I’m focusing on projects which evaluate sabotage, sandbagging, and oversight subversion in frontier agents. Additionally, I’m interested in coming up with novel alignment or control plans.
Three concrete project proposals:
- Realistic benchmarks for evaluating oversight subversion
- A core bottleneck to scalable/automated oversight research is high-quality, realistic benchmarks for studying oversight subversion. To help build effective oversight protocols, this project aims to evaluate AI agents’ capabilities to autonomously subvert oversight mechanisms in realistic environments.
- Collusion in automated oversight regimes
- Past work typically assumes that the automated overseer is trustworthy, for example, because it is not capable enough to pose a threat. We aim to relax this assumption to study the problem that frontier AI systems might collude with one-another to subvert human control.
- Automated shutdown protocols and honeypots
- We have developed an automated shutdown mechanism, wherein, frontier AI agents are trained to carry out a “shutdown protocol” if, and only if, they see a specific password. By leaving shutdown-passwords in honeypot style traps, AI companies can use automated shutdown protocols as an extra layer of defense against scheming systems which go undetected by oversight in internal deployment.
Mentorship Process
Weekly calls and messages on slack whenever you need. Potentially we’ll collaborate with other researchers I’m working with, if there’s a good fit.
Meeting Frequency
Project Selection Process
Preferred Characteristics
- Good communication – it’s important that you keep me updated with how you’re doing and how I can help!
- Experience with LM experiments (e.g., fine-tuning).
- Clear thinker and writer.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
For one of the projects you are interested in working on, please provide a concrete outline of how you would tackle the project in 150-250 words. What hypotheses do you have? How would you test them? What if you’re wrong? What existing results/implementations might you make use of? What will you do if these things don’t work? Please be as a concrete and specific as possible and avoid repeating the project description.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Dawn Song, Yiyou Sun, Xuandong Zhao
Research in our group spans multiple areas in AI, focusing on improving the reliability, interpretability, and security of large language models (LLMs).
Our group’s goal is to establish foundational principles and practical techniques for building safe and secure AI systems. We focus on understanding how the complex reasoning, multimodal understanding, and interactive capabilities of these systems may introduce novel safety and security issues or offer unique defense opportunities. Our work is organized around these key areas:
-
Evaluation & Automated Red Teaming: Developing robust methodologies and automated tools—including benchmark creation and stress-testing—to proactively identify and address safety and security issues.
-
Interpretability & Control: Enhancing transparency to reveal the decision-making processes of AI models, thereby identifying root causes of failures and bolstering trust in high-stakes applications.
-
Robust Defenses: Designing novel mechanisms (e.g., with powerful reasoning models) to detect, mitigate, and defend against adversarial attacks and other safety and security threats, ensuring reliable operation in complex environments.
-
Hallucinations: Develop advanced tools to identify, analyze, and understand the underlying causes of hallucinations in LLM models, enabling more reliable outputs.
-
Alignment & Post-Training: Investigating advanced techniques to align AI systems with human guidelines after the pretraining phase. This includes reinforcement learning from human feedback, alignment fine-tuning, and continuous post-deployment evaluations to guard against drifts and misalignments over time.
Research topics for MATS scholars can be flexible depending on individual interests and alignment with our goals. We are broadly interested in the areas outlined above and on our website: secure-ai.github.io/
Mentorship Process
MATS scholars will join our group Slack workspace to facilitate effective communication.
- Weekly one-on-one meetings with Xuandong and Yiyou
- Ad-hoc short meetings as needed
- Support with experiments, writing, and other research-related tasks
Meeting Frequency
Project Selection Process
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max]
What is the report/paper/post that you’ve written that you’re most proud of?
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Governance (9 streams)
Girish Sastry, Steven Adler
What technical governance measures can AI labs implement to improve the safety of their own models? What technical governance measures can governments lean on to make sure that labs’ AI systems remain safe? What are useful ways to identify or control these risks of an AI system in-practice?
We are interested in mentoring projects on technical governance solutions to AI risks, which could be applied at either an ecosystem- or lab-level. Exciting projects might include choosing an open problem in technical governance (see here) and making progress against this problem, or making progress on understanding what exactly makes this challenging and what a viable solution might entail. We are also interested in mentoring projects related to demonstrating and/or mitigating specific risks of AI systems, such as through building novel evaluations or demonstrating a protocol for catching these risks in-practice.
Representative Papers
- This paper offers an overview of open problems in technical governance: arXiv:2407.14981
- This paper describes dangerous capability evals and offers a rough taxonomy of capabilities one might wish to evaluate: arXiv:2305.15324
- This paper shows a series of results from benchmarking frontier models against these evals: arXiv:2403.13793
Mentorship Process
Girish & Steven intend to have a weekly video-call with a mentee, in which we review progress that has been made, set goals for the upcoming week, and talk through any tricky issues. In addition, Steven intends to be available for sporadic messages, though this will depend on the complexity of the topics (some will be best-discussed in the weekly meeting).
Meeting Frequency
Project Selection Process
Preferred Characteristics
Strong reasoning and writing abilities (especially about technical topics) and excitement about AI governance research.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream.
What’s your motivation for applying to this stream and project? (max 100 words)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Mauricio Baker
This stream focuses on AI policy, especially technical governance topics. Tentative project options include: technical projects for verifying AI treaties, metascience for AI safety and governance, and proposals for tracking AI-caused job loss. Scholars can also propose their own projects.
In general, by default, projects will aim to publish an arXiv preprint and will ideally be (small) team projects. Below are several tentative project options. Scholars are also welcome to propose their own; we’d aim to find a project that all involved are excited about.
-
Technical projects for verifying AI treaties: As outlined in a forthcoming report, there are many technical open problems in verifying compliance with a hypothetical international agreement on AI. This project will aim to make progress on one such problem, to help enable international cooperation on AI. The following are several examples of more specific potential projects. (1) If an untrusted user trains an LLM on a compute cluster with trusted system software, verify that the model weights are in claimed memory locations. (2) Model, in detail, the relationship between a GPU’s MFU and its power use. (3) Check if ML workload code “egregiously” wastes substantial model FLOP without affecting results (distinct from having inefficient MFU).
-
Metascience for AI safety and governance: How can we effectively make progress on the many R&D challenges involved in AI safety and governance? These challenges include R&D for technical AI safety, security, evals, and verification. There has been significant academic study of R&D progress itself (i.e. metascience), but these insights don’t yet seem to have been applied to AI safety and governance. In this project, a team will review metascience research to develop recommendations for how policymakers and private funders can effectively advance R&D on AI safety and governance. For example, how do different funding structures, such as ARPAs, NSF grants, and advance market commitments compare?
-
Proposals for tracking AI’s impacts on jobs: Today, governments and the public have little visibility into the economic impacts of AI—perhaps mostly general economic statistics and anecdotes. This situation could potentially be improved, e.g. by large regular surveys. That could make the coming economic impacts of AI more foreseeable and manageable. If these impacts turn out to involve mass job loss, this foresight might also change government and public attitudes on AI acceleration. This project would develop concrete recommendations for tracking AI’s impacts on jobs, such as proposing a draft survey and budget to a well-suited organization.
Representative Papers
A few papers by others in a similar spirit:
Mentorship Process
We’ll meet once or twice a week (~1 hr/wk total, as a team if it’s a team project). I’m based in DC, so we’ll meet remotely. I (Mauricio) will also be available for async discussion, career advising, and detailed feedback on research plans and drafts.
Meeting Frequency
Project Selection Process
Preferred Characteristics
No hard requirements. Bonus points for research experience, AI safety and governance knowledge, writing and analytical reasoning skills, and experience relevant to specific projects.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
Please provide 1-4 research or writing samples. (Please share a number in the higher side of that range, if you already have enough samples that are easy to share.) These can be of any type: research papers, class papers, blog posts, popular articles, etc. Share by pasting the URL (with link-sharing on if applicable), but not by directly pasting. Samples would ideally reflect your writing or research skills, relevant knowledge and interests, etc. Samples do not need to be formally published; you can also share a link to a non-public document or PDF.
For one of the projects you are interested in working on, please provide a concrete outline of how you would tackle the project in 150-250 words. What problem would you tackle? How might you divide the problem into more tractable subquestions, and how would you go about answering these subquestions? What is at least one major obstacle or failure mode you foresee, and how would you try to avoid it? Please be concrete and avoid repeating the project description.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Benjamin Bucknall
This stream will focus on questions in and about technical AI governance – that is, technical analysis and tools for supporting the effective governance of AI. Particular focus will be placed on questions regarding third-party scrutiny into AI systems and developers.
The following is a non-comprehensive list of projects I’d be interested in, though I’m also open to scholars suggesting their own projects that align with my interests.
- Audit cards: Building on model, system, and dataset cards, this project would propose ‘audit cards’ – templates for what should be reported when conducting third-party audits or evaluations. This would not specify how audits or evals should be carried out, but rather, ask ‘what information about how audits were carried out should be publicly reported?’ For example, specifying what access external evaluators were given and for how long; or specifying what actions were taken off the back of audit results.
- IDs for 3rd-party evaluations: One major issue in current external evaluations is that it can be hard to validate the system/model (or version thereof) that was the subject of some evaluation. It would be beneficial to have a unique identifier for model versions, potentially linked to other public knowledge about that system e.g. in a model card, that can be used to precisely refer to which model was evaluated. External evaluators should be able to verify that the model they’re interacting with corresponds to the ‘model ID’ they’ve been told, and thus has the stated properties or characteristics. Furthermore, evaluations themselves could also be given identifiers which are linked to the model/system/version on which they were performed, allowing for verification that a particular evaluation was conducted on the stated model.
- What needs to be done to implement secure external audits at scale: We’ve now seen some pilot tests of secure, privacy-preserving external access given for flexible AI audits. However, these tests are still very small-scale (on GPT-2, with a few benchmarking samples). What will be needed, both technically and institutionally, if these things are to be implemented at scale? For example, are there technical challenges that need to be overcome? What actors/stakeholders need to be involved to build out the necessary infrastructure or processes?
Mentorship Process
I will meet with the scholar on a weekly basis to discuss progress and any uncertainties or challenges being faced, as well as provide asynchronous advice and feedback. Research will largely be desk-based, though depending on the project, some amount of technical tinkering may also be good. By default, the scholar will mostly work independently and be self-guided, though I may be more proactively involved in research and writing if the project strongly aligns with my research priorities.
Meeting Frequency
Project Selection Process
Preferred Characteristics
An ideal scholar for this stream would have a strong technical background (CS, maths, engineering), with a solid understanding of contemporary ML systems and some hands-on coding experience. They would have some amount of experience carrying out independent research. They would have very good writing skills, and have some awareness/knowledge of AI governance & policy topics (though not necessarily experience working on such topics)
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max]
What would you most like to gain from carrying out this research project? (E.g. is there a skill you want to develop, something you want to learn?) What is the most likely way in which you do not achieve this personal objective? [150 words max]
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Alan Chan
New policies and technical tools will be needed to prepare for and manage a world with ubiquitous, human-level or above AI agents. This stream is about developing such policies and tools.
Project can range from the more technical or conceptual to the more policy-oriented. Some potential directions include:
- Demos or implementation of agent infrastructure: This direction would involve implementing, or at least sketching an implementation of, a specific type of agent infrastructure (but don’t feel limited just to the examples in the paper). Potential goals could be getting a working prototype to seed a startup or other organization, or catalyzing future work to build upon the prototype.
- Policy around agent infrastructure: What types of agent infrastructure might we expect the market to under provide? What policies can best help to provide it?
- Getting concrete about societal resiliencem: What should specific actors do to improve societal resilience? What policy levers are there to pull? An output could be, for example, a report or a series of memos directed to a particular decision-maker.
- Adapting existing regulation: What modifications of existing regulation are needed to stably and safely integrate AI agents into society? Things to consider might include liability, regulated professions, or economic policies.
I’m also generally open to other ideas that scholars may want to work on.
Representative Papers
Mentorship Process
We will meet at least once a week. I will also give you feedback on your written work outside of meetings (e.g., Google doc comments). Depending on your project and preferences, I may be more (e.g., driving more of the direction) or less involved (e.g., mainly giving feedback).
Meeting Frequency
Project Selection Process
Preferred Characteristics
Open-minded, intellectually curious, decent understanding of both technical ML and policy
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max]
What’s your motivation for applying to the stream? What do you want to get out of it? [150 words max]
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Matthew Gentzel
Escalation risks from state perceptions of AI capability, AI-enabled targeting, AI-enabled decision manipulation, and the impact of AI integration into nuclear command and control.
Core areas of research of relevance to scholars here include:
- Escalation and ambiguity risks related to state perceptions of AGI and AI-enabled military or manipulation capabilities. This line of research would aim to assess escalation risks related to how states are likely to react to advancing AI capabilities. Ideal projects will map these risks and pathways to avoid high-intensity great power conflict.
- AI-enabled military capabilities including nuclear counterforce targeting, missile defense, and other potentially transformative or destabilizing applications. This line of research would aim to assess the timelines of different transformative AI-enabled military technologies, their impact on negotiations, and their relative stability risks.
- AI-enabled decision manipulation risks in terms of cyber or adversarial attack, influence operations (disinfo, censorship, tailored threats, etc.) This research effort would investigate means to defend extremely high-stakes decision making processes from manipulation by increasingly capable AI models.
- Risks and benefits of AI integration into nuclear command and control systems. This effort would aim to identify the highest and lowest risk applications of AI in nuclear command and control systems, evaluating their impact on false alarm resolution, decision speed, pre-emption incentives, and decision alignment.
Mentorship Process
Mentorship will mostly consist of calls, sorting through research ideas and providing feedback. I’ll be up to review papers, and potentially to meet in person depending on timing.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Looking for intellectually curious and honest scholars, with some background on topics related to national security, game theory, or AI-enabled military and influence capabilities.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream.
What’s your motivation for applying to this stream and project? (max 100 words)
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Eli Lifland, Daniel Kokotajlo
We are interested in mentoring projects in AI forecasting and governance. We are currently working on a detailed mainline scenario forecast, and this work would build on that to either do more scenario forecasting or explore how to positively affect key decision points, informed by our scenario.
We are interested in mentoring projects in AI forecasting and governance. We are currently working on a detailed mainline scenario forecast that we will publish in early April. This is a time of particularly high uncertainty for us as we haven’t decided on a direction for after the project. Potential projects include but aren’t limited to:
- Scenario forecasting: Writing scenarios that branch off of our mainline scenario, exploring what might happen if important variables are changed. This could include detailed concrete threat modeling or detailed theories of victory.
- Improve and run tabletop exercises (TTXs): We’ve created a TTX that many have found valuable, for which we’d like to expand the range of audiences that enjoy it and scale it to more people.
- Policy strategy research e.g. menu for the intelligence explosion: Creating a menu of policy options for soon after AGI is developed.
- Targeted policy research: Do research to inform policymakers based on what seems important and in-demand.
Representative Papers
- Once it’s released: ai-2027.com
- In the meantime: How AI Takeover Might Happen Within 2 Years (not written by us but similar to our work)
Mentorship Process
We will have meetings each week to check in and discuss next steps. We will be consistently available on Slack in between meetings to discuss your research, project TODOs, etc. Eli will be the most consistently involved and do the 1-1 meetings. Daniel will give feedback on the research and be available for some discussions but won’t be as engaged in the day-to-day.
Meeting Frequency
Project Selection Process
Preferred Characteristics
The most important characteristics include:
- Strong reasoning and writing abilities
- Excitement about AI forecasting/governance research
- Autonomous research skills
- Ability to learn quickly
Also important, but not required characteristics include:
- Significant background knowledge in at least one area relevant to AI forecasting and governance (e.g. government experience, technical background, etc.)
- Significant background knowledge in AGI and existential risks
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
What is the report/paper/post that you’ve written that you’re most proud of?
Please provide the name/email/relation for 1-2 references.
Answer at least one of the following:
- What is the likelihood a government is in operational control of the first project to reach top-human-expert-level AGI across every remote work field, and why? (250 words max)
- Conditional on scheming / deceptive alignment arising naturally (i.e. not in a model organism / demo), how capable do you think models will be when it first emerges, and why? (250 words max)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
David Krueger
Project building on gradual disempowerment OR choose your own adventure.
Applicants are welcome to propose their own projects. I’m broadly interested in any work that seems to reduce AI x-risk, including technical work, but also comms, outreach, advocacy, activism, governance, and strategy.
================================
I’m interested in extending our work on gradual disempowerment topics. I’m interested in expanding on our recent work on gradual disempowerment (GD), see https://gradual-disempowerment.ai/. Other coauthors are also likely to be involved as co-supervisors for these projects. A few specific directions are:
- Responding to counterarguments.
People have argued against GD on grounds such as the strategy stealing assumption, the sufficiency of single-single alignment (e.g. should we expect it to “solve coordination”?), or the idea that humanity will become richer / more powerful in absolute terms, even if we have a shrinking “slice of the pie”. An ongoing project is to provide satisfying responses to all reasonable criticisms. - Mitigations for gradual disempowerment.
There are a few broad classes of work that might help prevent gradual disempowerment: (a) figuring out measurable indicators of gradual disempowerment; (b) preventing AIs from accumulating too much influence; and (c) actually giving humans more influence over large systems, possibly with the help of AI systems. It would be valuable to develop one or more of these, and figure out more specific interventions. - Visions for positive AGI futures and ways AI could help address gradual disempowerment.
There are few concrete visions for positive and plausible futures that seem robust to GD and preserve substantial human value. Possible elements of a solution include: (a) Improving coordination / cooperation, (b) maintaining political power in the hands of humans and political equality between humans, (c) addressing issues of cultural / value drift. A related question is: how might transformative AI play into establishing and maintaining such futures?
================================
In terms of strategy/governance, I’m interested in understanding how various levels of government or public verification capacity (e.g. knowing how many chips each side has and/or where they are and/or how they are being used, or knowing things about projects/teams/government plans, etc.) enable different security and coordination possibilities under different assumptions / future scenarios. For instance, under what assumptions would knowing where 99% of the compute is be sufficient to incentivize a coordinated global pause/slow-down? Under what assumptions would it be sufficient to protect against rogue actors/AIs? One particular assumption to examine is how safety and competitiveness trade-off (but there may be many other to consider).
Another project could be model regulation on use case reporting (including for internal deployments). This is a non-trivial transparency ask that might have fairly broad support, and could help with safety e.g. by: 1) identifying the most problematic/risky use cases, 2) facilitating assurance of common/established use cases, and 3) encouraging monitoring of unknown/unanticipated use cases. One basic pitch is: “We have model cards for intended use cases and evals for risks of anticipated use cases, but what about the ACTUAL use cases??”.
================================
In terms of technical topics, I’m most interested in work that highlights safety/security issues with models and/or techniques.
One particular technical project is to develop more flexible, learning-based jailbreaking methods that exploit the full attack surface: specifically, unlike with adversarial examples, jailbreaking permits using multiple queries and combining results in arbitrary ways. (cf “Fundamental Limitations in Defending LLM Finetuning APIs”, Adversaries Can Misuse Combinations of Safe Models", “LLM Censorship: A Machine Learning Challenge or a Computer Security Problem”).
I’m also concerned that the level of alignment observed in LLMs may not transfer to agentic systems acting in the real world (where there are more opportunities for insatiable power-seeking), and am interested in research that can help shed light on this. There are many avenues. One example might be seeing whether models describe themselves as likely to seek power, and whether these descriptions are accurate when they are actually presented opportunities for power seeking.
Recently, I’ve been thinking about alternative interpretations of the Alignment Faking result and how we might use formal notions of intent (cf https://arxiv.org/abs/2402.07221) to elaborate and test these hypotheses.
Another half-baked idea is (roughly) to try and distinguish between capabilities vs. alignment generalization failures based on whether more CoT decreases or increases the failure effect.
Representative Papers
Mentorship Process
Weekly meetings, and probably not too much outside of that. I expect scholars will be more successful if we can quickly arrive at a shared vision for the project, and if they are able to pinpoint their uncertainties, ask probing questions, pro-actively interrogate and challenge my views and visions, and execute independently.
Meeting Frequency
Project Selection Process
Preferred Characteristics
The key requirement is a high degree of motivation, independence, and alignment of research interests.
I may co-supervise projects with some of my group members or collaborators; this is something for us to discuss and decide on together. In this case, our weekly meetings may be shorter and you may interact more with them.
Coding and ML skills, etc. are important for the technical projects. Strong communication skills are required for any of them.
Only applicants for technical project need complete the CodeSignal test!
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
Please provide the name, email, and relation for 1-2 references.
If possible, suggest referees involved in AI Safety.
What do you want to work on? Please give a 3-5 paragraph pitch for your research idea that fits this stream.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Gabriel Kulp, Jacob Lagerros
Our team explores the hardware hacking abilities of present AI models by studying threat-models like self-exfiltration, covert communication, and sabotage. We measure side-channels of GPUs (like power consumption or electromagnetic radiation) while the GPU is running an LLM, then tune the LLM to control these measurements.
There is a cluster of potential projects to choose from. We think many people far underestimate the extent of shenanigans current models are capable of: sending data from an airgapped server to a nearby phone by turning the GPU into a makeshift antenna or leaking information from a locked down data center by affecting the phase of input power to generate ripples that can be read out from normal wall outlets in nearby residential apartments. Properly considering these capabilities is critical for building robust alignment, control and security schemes.
As a team, we will decide which to pursue based on individual interest and skills. Broadly, we want to demonstrate information leaving a GPU in unexpected or surprising ways. Additionally, we are interested in training a model to induce leakage, thus forming a covert channel from the side-channel. Here are some things which we hypothesize are leakable, especially if the model is tuned to do so:
- The full text of the prompt
- Presence/absence of a specific token in the prompt
- Argmax of the output logits
- Semantic match in some middle layer
- The “true” response “behind” a refusal
The experimental setup is a frontier accelerator (H100 or B100) with various sensors attached, such as an oscilloscope on the power supply and an electromagnetic probe near the GPU die. These and more sensors are software-accessible from the same Jupyter notebook which runs PyTorch code on the GPU, meaning that you can correlate sensor readings with code execution, performance counters, and built-in sensors (as in nvidia-smi).
Representative Papers
- A blog post exploring the sensitivity of GPU power draw to the content of matrices is multiplies: https://www.thonking.ai/p/strangely-matrix-multiplications
- A rundown of common methods in hardware security: https://www.newae.com/embedded-security-101
- A paper on leaking model weights from GPUs: arXiv:2312.07783
- A paper on leaking information from existing on-chip sensors: arXiv:2309.11894
- Another paper on leaking model weights: arXiv:2211.05590
Mentorship Process
A typical week would involve an async check-in, a video check-in, and sometimes synchronous exchanges on Slack to debug, plan, or share things. Gabriel and Jacob will both be available for periodic check-ins.
Field-deployed scholars might also be able to travel to data center sites with Jacob, or alternatively work with the Ulyssean team from their lab in San Francisco.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Please note: experience with hardware is not a requirement for this stream, as long as you are willing to work hard and learn fast, and can show other evidence of exceptional ability. If in doubt: we encourage you to apply!
We will provide you with a lot of autonomy and plug-and-play access to a rare combination of tools and equipment—in exchange we expect you to have a strong self-direction, intellectual ambition, and a lot of curiosity. We think the right kind of scholar could flourish in this lab environment and produce work of exceptional academic and engineering calibre, but it will require you to proactively show up to make the most of the opportunity. This stream requires you to have a tight experiment loop to form and test hypotheses on the fly.
Motivated scholars might also have the opportunity to work field-deployed with the Ulyssean team (who builds mission-critical infrastructure for AGI data centers), including visiting major data center build-outs across the US, UK, or UAE, and conducting on-site research on state-of-the-art AI clusters.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
What project(s) would you be interested in working on, and why? This can be a project I listed, or your own project. If your own project, please provide a 100-150 word description.
For one of the projects you are interested in working on, please provide a concrete outline of how you would tackle the project in 150-250 words. What hypotheses do you have? How would you test them? What if you’re wrong? What existing results/implementations might you make use of? What will you do if these things don’t work? Please be as a concrete and specific as possible and avoid repeating the project description.
Please give a brief overview (max 200 words) of your aims, weaknesses, and uncertainties when it comes to the project(s) you are interested in. What are you most interested in gaining by completing the project (e.g., a new skill, knowledge, any particular kinds of concrete outputs)? What would you most need my support on? What are you most uncertain about?
What are your chances of being interested in continuing to work together beyond the ~2-month MATS program? E.g. Extending for another ~2 months until the project runs to completion. Include a %, with some explanation of your number as needed.
Tell us about something technical you built. (If available, please link to a portfolio / GitHub / other documentation. If the linked project is sufficiently self-contained, feel free to only answer this question with a single sentence summarizing it.) We’re looking for evidence that you can get really impressive things done with computers when given the tools and opportunity to pursue your curiosity and ambition. We are especially interested in projects which span layers of the tech stack and demonstrate a full-system understanding.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Lisa Thiergart
Contributing to technical projects of the Security Level 5 Task Force, which include (non exhaustive list) research to help define industry standards for SL5 together with frontier labs, building prototypes of particular components such as inter accelerator bandwidth limiters, contributing to designing components of air-gapped ML dev environments, contributing to testing and iteration processes, researching security-productivity tradeoffs and designing and testing mitigations.
Representative Papers
If you’re seriously interested and want to read more, please email me at lisa@intelligence.org to get access to a more confidential description of the goals and projects of the SL5 Task Force. Please include your CV / github and 2-3 sentences about why you’re interested and what relevant types of contributions you could imagine (I know you have limited info, feel free to make a best guess).
You can check out pages 31-32 of this RAND report which introduces security levels as a standard for AI Security.
Mentorship Process
We’ll meet ~1h/w, sometimes together with others from the task force who your work interfaces with. Depending on the project, we might have some kind of code review process (if its not a standalone project) as well as some intermediate demos. In terms of communication outside of meeting times, it depends a bit on your preference but I think its usually effective to have a daily standup with your research manager and share some written project updates via slack/signal outside of meetings.
Meeting Frequency
Project Selection Process
Preferred Characteristics
You take pride in building something that is close to the best thing you can produce. You’re reliable, communicate if you need help, and you’re not shy to try something hard you haven’t done before. You’re excited to learn new things fast, are humble about what you don’t know yet and you’re mission-aligned (ie, care about reducing security risks that exist around the development of frontier AI and take the work seriously).
- I’m looking for people with a security mindset. I care less about years of experience, and more about demonstrated ability to meaningfully contribute to the projects. Impressive projects mean more than degrees in CS / electrical engineering, but if you don’t have a degree in those fields I’ll want to see more evidence of your contributions.
- The project description really excited you. You already started thinking about what that might involve.
- Having a solid high level understanding of how ML training works, bonus to have practical ML Eng experience such a having fine-tuned a model / trained a model
- Security experience is a strong bonus, but SL5 means securing against nation state actors, which I don’t expect anyone to have experience with.
- Hardware experience (with NICs, accelerators, etc) is a bonus, but not a requirement. There will also be SW only projects.
- Experience with iterative / user-focussed development
- Ability to work on a team & communicate well, general self-management skills
If you feel unsure if you fit this, I encourage you to apply and to keep in mind that many people underestimate themselves.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Security (2 streams)
Florian Tramèr, Daniel Paleka
The desired output of the stream is a first draft of a good academic research paper. In our typical paper, we break claimed soft guarantees on AI systems by designing attacks that probe the worst-case performance of a system.
See recent publications on https://spylab.ai/publications/. A decent MATS project would be a follow-up to one of the papers there. We will make a doc with potential projects and scholars will zero in on the project in the first week of the stream.
Representative Papers
Mentorship Process
Florian will be available for a call on a weekly basis. Daniel should be somewhat more available for questions and chats, and likely in-person for a part of the stream.
Meeting Frequency
Project Selection Process
Preferred Characteristics
High productivity (delivers intermediate results more frequently than on a weekly basis), and able to communicate the state of the project and current uncertainties quickly.
Consider the following posts:
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s a thing that you’re proud of that you’ve done working on your own in a single day? (Explain, and link a demo / figure in a pdf / a range of commits in a repo)
Do you have a research idea of your own that you want to work on and would benefit from our mentorship? Please give a 1-2 paragraph pitch for your research idea that fits this stream. Do not try to convince us that it’s relevant to our lab’s research agenda. Focus on making it a good pitch for a good research idea. If you need to, include links to any relevant materials, but convey your core idea clearly and concisely.
Note: this is to understand your research interests and thinking, not a commitment – if we match, we’ll work together to figure out what you’ll work on at the start of the stream.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Keri Warr
Implementing SL4/5 and searching for differentially defense-favored security tools.
Making direct contributions to lab-relevant technologies such as confidential compute or kubernetes security, or prototyping differentially defense-favored security tools e.g. an AI-powered Intrusion Detection System (IDS).
Mentorship Process
I am a first time mentor, not entirely sure. I love asynchronous collaboration, and I’m happy to provide frequent small directional nudges, or with a bit more lead time do a thorough review of a design doc, etc. The average week hopefully looks like either trying out a new angle on a problem, or making meaningful strides towards productionizing an existing solution.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Excited about doing practical, immediately useful hands-on work that could plausibly be deployed within O(months). Uses tight devloops to iterate quickly. Willing to be very confused.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max]
Please provide the name/email/relation for 1-2 references.
Present an argument for and an argument against building AI-automated software vulnerability discovery tools. Spend <15 minutes on this total
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Interpretability (4 streams)
Neel Nanda
Applications for Neel’s stream closed on February 28th.
When training an ML model, we may know that it will learn an algorithm with good performance, but it can be very hard to tell which one. This is particularly concerning when “be genuinely helpful and aligned” and “deceive your operators by acting helpful and aligned, until you can decisively act to take power” look behaviorally similar. Mechanistic interpretability is the study of taking a trained neural network, and analysing the weights to reverse engineer the algorithms learned by the model. In contrast to more conventional approaches to interpretability, there is a strong focus on understanding model internals and what they represent, understanding the model’s “cognition”, and putting a high premium on deep and rigorous understanding even if this involves answering very narrow and specific questions. Better interpretability tools seem useful in many ways for alignment, but mechanistic approaches in particular may let us better distinguish deceptive models from aligned ones.
Mentorship structure information not available for this research stream.
Preferred Characteristics
Talented at research, empirically-minded, very productive
Adam Shai, Paul Riechers
In this stream we will explore extensions and implications of our discovery that neural networks pretrained on next-token prediction represent belief-state geometry in their activations. We will build on this fundamental theory of neural network representations in order to discover what AI systems are thinking, and understand their emergent behaviors.
Work in this stream will build on our discovery of belief-state geometry in the activations of pretrained neural networks. The project can leverage applicants’ strengths in mathematical modeling and/or ML engineering. We welcome highly driven and relatively autonomous researchers that would like to benefit from our mentorship while taking the lead on a relevant project of their choice. Some possible projects will study in-context learning, out of distribution generalization, theoretically benchmarking SAEs, designing new methods for finding features, methods of compressing representations in transformers, and interpretability on RL models.
Representative Papers
- Our NeurIPS paper establishes that transformers represent belief geometry in their residual stream: arXiv:2405.15943
- Our recent follow-up paper shows mechanistically how these belief states—that one would expect from iterative Bayesian updating—are constructed in transformers despite their feedforward architecture: arXiv:2502.01954
Mentorship Process
Early in the program, Paul and Adam will meet in person with scholars to help them get up to speed on the theoretical and technical background needed to understand and contribute to our framework. Subsequent weekly meetings with mentees aim to answer questions, unblock research, explore project ideas, and give feedback and suggestions on research.
Meeting Frequency
Project Selection Process
Preferred Characteristics
The project can leverage applicants’ strengths in mathematical modeling and/or ML engineering. We welcome highly driven and relatively autonomous researchers that would like to benefit from our mentorship while taking the lead on a relevant project of their choice. The ideal scholar has experience in either research (e.g., PhD in any field), or software/ML engineering.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
In a Google Colab notebook, please recreate the fractal result by training a neural network architecture of your choice (need not be a transformer) on Mess3 data as described in our original paper: https://arxiv.org/abs/2405.15943
Please make sure that the link is publicly viewable before submitting.
Please propose a project outline that extends our work on belief state geometry. (This is to help us see how you think. You are not restricted to this project proposal if accepted.) What is the fundamental question you aim to answer? What are the experiments you are going to run? What are the possible outcomes and their interpretations? What are the ways in which results might be misleading, and are their ways to control or otherwise deal with that?
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Lee Sharkey
Lee’s stream will focus primarily on improving mechanistic interpretability methods (sometimes known as ‘fundamental interpretability’ research).
- Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings.
- Safety motivations aside, capable AI systems are extremely interesting objects of study, and doing digital neuroscience on them is comparatively much easier than studying biological neural systems.
- A majority of mech interp work in the last two years has involved sparse autoencoders, which have various issues. In response to these issues, my team and I at Apollo Research developed a new approach to decomposing neural networks, called Attribution-based Parameter Decomposition (APD).
- We think APD-like approaches are the most likely candidate to overcome the issues with current methods. We believe it may offer solutions to many issues in mech interp, such as attention head-distributed representations, circuit identification, and understanding feature geometry. But APD needs to be made more scalable and robust to hyperparameters before it can be applied reliably to (whole) LLMs. APD is also suggestive of approaches that may make it possible to develop intrinsically interpretable architectures. MATS projects in my stream should at least be informed by APD if not build on it directly.
Representative Papers
- I will expect candidates to be familiar with the paper Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (arXiv:2501.14926)
And ideally some of the surrounding literature, which can be found in the references section of the above paper.
Mentorship Process
Mentorship looks like a 1 h weekly meeting by default with slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations.
Meeting Frequency
Project Selection Process
Preferred Characteristics
As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:
-
Science background: What indicators are there that the candidate can think scientifically, can run their own research projects or help others effectively in theirs?
-
Quantitative skills: How likely is the candidate to have a good grasp of mathematics that is relevant for interpretability research? Note that this may include advanced math topics that have not yet been widely used in interpretability research but have potential to be.
-
Engineering skills: How strong is the candidates’ engineering background? Have they worked enough with python and the relevant libraries? (e.g. Pytorch, scikit learn)
-
Other interpretability prerequisites: How likely is the candidate to have a good grasp of the content gestured at in Neel Nanda’s list of interpretability prerequisites?
-
Safety research interests: How deeply has the candidate engaged with the relevant AI safety literature? Are the research directions that they’ve landed on consistent with a well developed theory of impact?
-
Conscientiousness: In interpretability, as in art, projects are never finished, only abandoned. How likely is it that the candidate will have enough conscientiousness to bring a project to a completed-enough state?
In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
What do you want to work on? Please give a 3-5 paragraph pitch for your research idea that fits this stream.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Hidenori Tanaka
As AI systems become more capable and human-like, ensuring alignment requires going beyond traditional benchmarking methods and interpretability techniques. Our research stream introduces a novel paradigm, “cognitive alignment,” blending cognitive science, physics, neuroscience, and psychology to develop rigorous mathematical frameworks. Our goal is to precisely characterize how AI systems represent and interpret their environment, enabling control over AI behavior to ensure safety and trustworthiness.
Building frameworks of cognitive alignment
As the capabilities of AI systems become increasingly agentic, we need frameworks to characterize the “structure” of how AI systems represent the environment they interact with and how that informs their behavioral choices. We will achieve this by developing mathematical formulations inspired by cognitive science more broadly. Specifically, we will utilize inverse reinforcement learning to characterize how AI agents interpret the states they are in and how they make subsequent behavioral decisions accordingly.
Mentorship Process
Scholars will have regular weekly meetings with mentors to discuss research progress, obstacles, and ideas. These meetings emphasize collaborative problem-solving, deep discussion, and clear conceptual development. Mentors will actively support scholars in building clarity and self-sufficiency in their research process.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Scholars in our stream should:
- Be passionate about interdisciplinary research, particularly integrating cognitive science, physics, psychology, and neuroscience.
- Enjoy developing precise mathematical models and theoretical frameworks.
- Demonstrate curiosity and a commitment to rigor in experimental design and analysis.
- Have an interest in exploring fundamental questions about the nature of intelligence, both artificial and human.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
AI agency (6 streams)
Andreea Bobu
We are broadly interested in AI agents learning to do tasks for, with, and around humans. Our main research motivations are ensuring that these agents are value aligned with the humans they are meant to support, whether the human is an expert designer, a novice end user, or a stakeholder of the AI system. The work that we do involves reward learning, learning from (potentially multiple) kinds of human feedback, active learning, representation learning, quantifying misalignment.
We are broadly interested in AI agents learning to do tasks for, with, and around humans. Our main research motivation is ensuring that these agents are value aligned with the humans they are meant to support, whether the human is an expert designer, a novice end user, or a stakeholder of the AI system. This is challenging for many reasons. Values are notoriously difficult to specify to our AI agents.
Not only that, but learning values (for instance, as reward functions) — which is the current popular alternative to specifying them by hand — is also difficult:
- getting the right data to supervise the learning (via RLHF-style methods) is nontrivial because humans are imperfect, not infinitely queryable, and have unique and changing preferences and values;
- the representations we choose to mathematically express values may themselves be misaligned, thus preventing us from ever being able to capture the “true” desired values;
- reliably quantifying misalignment to be able to robustly tell when the AI system is safe for operation is still under explored.
To tackle these challenges, in our research we bring cross-disciplinary expertise from deep learning, human-robot interaction, mathematical human modeling, cognitive psychology, and probabilistic uncertainty, with the hope of creating more aligned, generalizable, and robust learning algorithms and AI systems.
With this context in mind, research topics can be flexible depending on interests and fit but we are broadly interested in the following three thrusts of work (which you can also read more about on our lab website):
- Getting (enough of) the right data to align our AI agents’ values with humans. Typical RLHF methods treat humans as infinitely queryable oracles. While we do have large amounts of data on the internet, each individual human the AI agent will interact with will have unique and changing preferences, values, and biases that canned internet data alone may not reflect. Since we want AI agents to quickly adapt and align their existing values with each human user, how can we relax the infinitely queryable oracle assumption?
- How do we create data efficient algorithms that ask for the right input (maximize information gain) while not over-burdening the human (minimize human effort)? Can we make use of active learning techniques that combine these objectives in order to minimize the amount of queries required from the human?
- What kind of input do we want to ask humans for? Preference queries are low effort but contain relatively little information for the agent. Could we enhance the information gain by appending targeted explanations to the preferences (e.g. “I prefer A to B because A has less political content”)? Could we combine different feedback modalities (showing inputs like examples of correct behavior vs telling inputs like language corrections) for increased agent alignment?
- How do we make use of large pre-trained models, while still adapting to the specific human’s needs? LLMs contain very powerful human priors that we have found to substantially reduce human burden when learning how to execute tasks from them. Can we make use of these LLM priors to amplify the data we receive from humans? Can we prompt LLMs to think about the reason underlying a human input or decision, and then use that causal reason to generate new situations where the human would likely behave in a similar way? Can we use LLMs to elicit preferences from humans?
- Aligning agent representations with humans for increased downstream value alignment. We show in recent work that if the representations we choose to express values are misaligned with those of humans, the downstream learned values will also be inevitably misaligned. Misaligned representations lead to unintended, unexpected, and even catastrophic behavior. How can we learn agent representations of the world that are aligned and in agreement with those of the humans they are cooperating with? Our previous efforts have looked at methods that learn representations one concept at a time vs all-at-once, but each approach has important tradeoffs: one concept at a time leads to more interpretable and structure-inducing representations, but they require the human to think of and explicate every dimension of their representation; all-at-once is more scalable and effortless but results in more entangled and less interpretable representations. Can we somehow get the best of both worlds: can we obtain more interpretable and disentangled representations while making it easier for the human to teach the AI agent about their representation? Moreover, if these representations are indeed interpretable, can we use them to facilitate smoother cooperation and communication between humans and their AI agent?
- Reliably quantifying misalignment. In order to even know when to stop current (potentially unsafe) behavior and initiate the process of alignment with the human, the AI system needs to be able to quantify misalignment. Our past work has looked at quantifying misalignment on small Bayesian models, but there is little principled work on quantifying misalignment on large models like LLMs. How can we quantify and detect misalignment by monitoring inconsistencies between the human’s feedback and the AI agent’s behavior? How can AI agents disentangle between inconsistencies due to incorrect models vs due to noisy human judgements? How should AI agents resolve that misalignment once detected? Should AI systems always assume that the human is right and thus they need to adjust their model, or should there be a debate process where the human and the agent arrive at a shared model of the world together?
air
Mentorship Process
Mentorship with us will look like:
- Weekly meetings on Zoom for 30-60 minutes depending on need (sometimes together, other times individual with one mentor)
- Frequent Slack communication with expected response time <24h. We are happy to unblock you whenever you’re stuck.
- Mostly medium- and high-level feedback as we make progress through the project. We won’t be able to help you with the specific parameters for a model or with specific lines of code, but we can provide guidance for how to debug something.
- Detailed feedback on paper drafts.
- Detailed feedback on presentation material (e.g. talk, poster) once a paper gets in.
Meeting Frequency
Project Selection Process
Preferred Characteristics
The best candidates are those who are passionate about exploring how thinking about the human element (designers, end users, or stakeholders of AI) in the human-AI interaction equation can lead to building more aligned, generalizable, and robust learning algorithms and AI systems.
In terms of hard skills, ideal candidates should have one or more of the following:
- Previous research experience in machine learning (or related fields), ideally with published work.
- Strong software engineering background and familiarity with machine learning tools and techniques.
- Experience with running human subject experiments and statistical analyses is a plus.
As for soft skills:
- The ability to communicate clearly.
- Strong intrinsic motivation and curiosity for learning.
- A desire to publish an academic paper.
- A sense for how to build a story is useful, in that doing good impactful research also involves communicating said research in a way that the community is drawn to and compelled by.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream.
Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Alex Turner, Alex Cloud
In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing to unsupervised capability elicitation. If you’re theory-minded, maybe you’ll help us formalize shard theory itself.
Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequentwork has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing potentially unlocks the ability to isolate undesired circuits to known parts of the network, after which point they can be ablated or studied.
What other subfields can we find together?
Formalizing shard theory
Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.
Mentorship Process
Turner is available for 45-minute weekly 1:1s and a weekly team meeting, with additional communication over Slack. He regularly stops by the office to hang out with scholars. Cloud expects to have similar availability. Both are deeply committed to scholar flourishing and have a long track record of mentorship.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Ideal candidates would have:
- Academic background in machine learning, computer science, statistics, or a related quantitative field.
- Familiarity with ML engineering.
- Proven experience working on machine learning projects, either academically or professionally.
- Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.
- Ability to write up results into a paper.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Briefly describe your most relevant skills and experience other than AI research (e.g. software engineering, other research, teamwork, writing, or anything else) (max 250 words).
Propose a follow-up experiment to section 4.2 of the gradient routing paper (https://arxiv.org/abs/2410.04332) and explain the relevance of the experiment to AI safety efforts. (Suggested length: ~250 words for experiment, 1-4 sentences for relevance.)
Explain your strongest disagreement with other alignment thinkers. Consider only your own inside-view understanding, and don’t defer to others’ expertise.
Are you open to working in a small team? We plan to prioritize people that answer “yes” to this question, although it is not a strict requirement.
If you’re open to working in a team, briefly reflect on how you might make this go well. What would a great collaboration look like to you? In your answer, you might draw on prior experience in teams, but you don’t have to. (max 250 words).
If accepted, would you participate in the program full-time, in person in Berkeley (Jun 16 - Aug 22)? If you’re not sure, please explain. We are happy to accept people who have other commitments, as long as MATS would be their primary focus.
If the project is going well, and you receive funding to continue your work afterward, do you expect to choose to participate in the MATS extension (between 6 and 12 months after the main program)? This could be remote. It’s okay if you aren’t sure if you’ll be available for the entire period.
Is there anything else we should know? (If the rest of your application speaks for itself, feel free to leave this blank.)
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Michael Dennis
Progress in AI is driven towards better solving problem specifications like next-token prediction, the policy gradient loss, the diffusion denoising-loss, the objective in RLHF or the debate. In this stream we will build towards an understanding of how the objectives of a system predict its behavior and failure modes, and aim to design objectives whose failures are more likely to be benign. The approach tends to be a mixture of game and decision theoretic analysis of objectives, and empirical implementation of the systems to demonstrate the intended effects.
I am interested in understanding the failure modes of current popular problem specifications and designing problem specifications which do not have those failure modes. I believe we can jointly develop ideas in this area into a successful research program.
A couple of example projects include:
-
A classification scheme for misspecification errors. Every loss function is a misspecification of what the designer wants out of the model. For instance, GAN losses end up focusing on the parts of the space of images that are sufficiently easy to model, and thus can avoid hard-to-generate images like those containing hands. However, it’s not always clear which aspects of the misspecification of this loss actually mater. In this work we would develop methodology for quickly, and ideally automatically, understanding the sorts of misspecification inherent in a loss function. Such a system would have allowed us to quickly deduce from the original gan loss that hands would be systematically under-represented.
-
Policy-gradient with increased specification flexibility. The field of AI often chooses a specific specification and dedicates a years or decades optimising it. If these specifications are wrong, it is a recipe for technological lock-in where the specification would be impractical to change even if we agree it ought to. In this project, we would investigate the extent to which we can use the existing technology stack to optimize objectives which are distinct from the expected discounted sum of scalar rewards. This could either be through using policy-gradient in carefully adaptive environments as in UED, or it could be through determining a new bellman backup and policy gradient. In any case, it would be focused on a solution which is scalable and adds useful flexibility.
Representative Papers
- The level of abstraction we would be trying to intervene on would be the one in these papers, or one meta-level up (A theory of how to more easily make or understand these papers): PAIRED (arXiv:2012.02096), Debate (arXiv:1805.00899), CIRL (arXiv:1606.03137), Refining Regret (arXiv:2402.12284), Other Play (arXiv:2402.12284), RLHF (arXiv:1706.03741)
Mentorship Process
Mentorship looks like:
- Weekly 1-on-1s
- Slack presense
- Ad-hoc short meetings
- Reviewing of proofs or writing as needed
Meeting Frequency
Project Selection Process
Preferred Characteristics
Ideal candidates would have:
- Academic background in machine learning, computer science, mathematics, or a related quantitative field.
- Familiarity with ML engineering.
- An interest in theoretical tools. (Ideally, has written a proof)
- Strong intrinsic motivation and curiosity for learning.
- A desire to publish an academic paper.
- Strong communication skills
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What’s your motivation for applying to this stream and project? (max 100 words)
Please provide the name, email, and relation for 1-2 references.
If possible, suggest referees involved in AI Safety.
What do you want to work on? Please give a 3-5 paragraph pitch for your research idea that fits this stream.
What are your chances of being interested in continuing to work together beyond the ~2-month MATS program? E.g. Extending for another ~2 months until the project runs to completion. Include a %, with some explanation of your number as needed.
Explain your strongest disagreement with other alignment thinkers. Consider only your own inside-view understanding, and don’t defer to others’ expertise.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Lewis Hammond
Projects in my stream largely focus on multi-agent safety, cooperative AI, and/or governing AI agents. Several projects are also relevant to scalable oversight. They range from conceptual/theoretical to empirical, and include a couple of AI governance/ethics projects for those with less technical backgrounds.
Please feel free to contact me for access to a non-public database of around 15 projects that I am currently interested in mentoring. I am also happy to consider proposals from mentees. The broader themes of my proposals are:
- Detecting collusion, the emergence of ‘collective agents’, and novel, dangerous collective goals or capabilities;
- Analysing the importance of differing capability levels in strategic interactions between AI agents;
- Understanding how to measure the ‘cooperative intelligence’ of AI agents, as well as differential progress on cooperation;
- Creating new proposals for governing AI agents, potentially inspired by existing domains such as trading algorithms in financial markets.
If you have research ideas in these areas, or more generally in multi-agent safety, cooperative AI, and/or governing AI agents, then please feel free to get in touch with me.
Representative Papers
- Multi-Agent Risks from Advanced AI (arXiv:2502.14143)
- Cooperation and Control in Delegation Games
- Secret Collusion among AI Agents: Multi-Agent Deception via Steganography
- Neural Interactive Proofs
- Open Problems in Technical AI Governance (arXiv:2407.14981)
- Visibility into AI Agents
- Reasoning about Causality in Games
Mentorship Process
I aim to be a supportive and responsive mentor, though I do have many other demands on my time outside of mentorship. On average, I would expect to have a one hour meeting per week at a regular slot, as well as occasional check-in calls and quick catch-ups that I can use to help unblock mentees on a more ad hoc basis, if needed. I will be fairly responsive via Slack, but often prefer short calls to long message exchanges. Mentees should send me material in advance if they want my feedback, and prepare specific questions and updates for each meeting.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Each of my listed projects describes the background I would expect a mentee to have, and the difficulty level of the projects along different axes. In general, I would expect scholars to be interested in multi-agent problems and have some relevant background (e.g., having taken a course on game theory, or an equivalent level of self-study). Scholars should also be reasonably autonomous, proactive, and good at managing their own time and projects (though I understand that MATS provides some assistance in this regard).
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What project(s) would you be interested in working on, and why? This can be a project I listed, or your own project. If your own project, please provide a 100-150 word description.
For one of the projects you are interested in working on, please provide a concrete outline of how you would tackle the project in 150-250 words. What hypotheses do you have? How would you test them? What if you’re wrong? What existing results/implementations might you make use of? What will you do if these things don’t work? Please be as a concrete and specific as possible and avoid repeating the project description.
Please give a brief overview (max 200 words) of your aims, weaknesses, and uncertainties when it comes to the project(s) you are interested in. What are you most interested in gaining by completing the project (e.g., a new skill, knowledge, any particular kinds of concrete outputs)? What would you most need my support on? What are you most uncertain about?
What are your chances of being interested in continuing to work together beyond the ~2-month MATS program? E.g. Extending for another ~2 months until the project runs to completion. Include a %, with some explanation of your number as needed.
Please provide the name, email, and relation for 1 -2 references. If possible, referees should work on AI safety, and be people you have collaborated with on research before.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Richard Ngo
This stream includes two projects. The first is an agent foundations project aiming to further develop a unified theory of intelligent agency. The second is an AI governance project on designing unprecedentedly trustworthy institutions.
My stream includes two projects, corresponding to the two broad ideas that I’m spending most of my time researching. The main advantage of this is that scholars will work closely with me on my main priorities. The main disadvantage is that the projects won’t be very clearly-defined or very optimized for producing legible pieces of work (e.g. research papers). Instead, they’ll be very open-ended; once scholars start we’ll work together to figure out what aspects of the projects are most worth exploring.
-
Project 1: Coalitional Agency. This is a theoretical project that will work on unifying the expected utility maximization and active inference frameworks. The goal is to make progress towards a single unified theory of intelligent agency which can characterize the types of goals that arise in neural-network-based agents. You can read more about the ideas behind this project here: https://docs.google.com/document/d/1sLpW15FFowO4w3YmSPGya1hHRKszwzY5FPLDULDcdos/edit?tab=t.0
-
Project 2: AI Trust. This project will work towards designing unprecedentedly trustworthy institutions, as a way of cutting through tradeoffs in AI governance. Specifically, we’ll be trying to figure out how institutions built around AI agents could provide much stronger guarantees of honesty than current human institutions. For more about this project, see the last four minutes of this talk: https://x.com/RichardMCNgo/status/1863627124801737105.
Mentorship Process
I’ll come meet scholars in person around 2 days a week on average. On those days I’ll be broadly available for discussions and brainstorming. On other days scholars can message me for guidance (though I’d prefer to spend most of my effort on this during the in-person days).
Meeting Frequency
Project Selection Process
Preferred Characteristics
My main criterion for selecting scholars will be clarity of reasoning (and, for project 1, mathematical reasoning in particular). In your application, please share pieces of writing or research that display this.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
Please provide 1-4 research or writing samples. (Please share a number in the higher side of that range, if you already have enough samples that are easy to share.) These can be of any type: research papers, class papers, blog posts, popular articles, etc. Share by pasting the URL (with link-sharing on if applicable), but not by directly pasting. Samples would ideally reflect your writing or research skills, relevant knowledge and interests, etc. Samples do not need to be formally published; you can also share a link to a non-public document or PDF.
If interested in Project 1 (coalitional agency), read the following post (https://docs.google.com/document/d/1sLpW15FFowO4w3YmSPGya1hHRKszwzY5FPLDULDcdos/edit?usp=sharing) and write a response of around one page. This may include critiques of the post, filling in gaps, ideas for follow-up work, etc.
If interested in Project 2 (AI trust), watch the last four minutes of this talk (https://x.com/RichardMCNgo/status/1875310288955961794, starting from the slide on “AI as extreme concentration of power”). Write around half a page on an AI trust building block (either one I listed or one of your own) and how it might be used. Then watch this talk (https://x.com/RichardMCNgo/status/1863627124801737105) and write around half a page of analysis/critique of the claims made.
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300
Fernando Rosas
Interested in applications of computational mechanics to AI interpretability, with particular interest in RL and also transformers. I’m also interest in the application of formal notions of emergence to interpretablity
In general, I’m interested in any formal approach to understand how AI systems build internal representations. My main interest is to extend the line of work started by Shai et al (2025), which identifies the internal world model of a transformer using tools of computational mechanics. I’m most interested of extending it in two directions:
- Applying this to RL scenarios, where the belief structure would characterise how agents in partially observed setting form beliefs about the world.
- Apply to transformers by using ideas from Rosas et all (2024) related to coarse-grainings to understand generalisation and compositionality.
More generally, I’m interested in using various existent formal notions of what “emergence” means to interpretability. There are various approaches to formalise emergence (see eg. (Rossa 2024) and (Mediano 2022)), which focuses on different aspects. I’d be interested in exploring if the computational tools related to these approaches could be used to quantitatively investigate aspects of emergence in various AI systems.
Representative Papers
- Transformers represent belief state geometry in their residual stream (arXiv:2405.15943) (2025)
Learning diverse causally emergent representations from time series data (2025) - Software in the natural world: A computational approach to hierarchical emergence (arXiv:2402.09090) (2024)
- Synergistic information supports modality integration and flexible learning in neural networks solving multiple tasks (arXiv:2210.02996) (2024)
Greater than the parts: a review of the information decomposition approach to causal emergence (2022) - Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data (arXiv:2004.08220) (2020)
Quantifying high-order interdependencies via multivariate extensions of the mutual information (2019)
Mentorship Process
1 hour online meetings once per week (towards the end of the period it could be more). Additional communication via email would also possible.
Meeting Frequency
Project Selection Process
Preferred Characteristics
Good understanding and some experience working with either transformers or RL is required. It would be ideal if the candidate is familiar with computational mechanics, but this is not compulsory. Familiarity with information theory is also a plus.
In general, the candidate should have either strong theory skills (good knowledge of maths and/or physics and/or theoretical CS) and reasonable coding skills, or strong coding skills with a big interest in theory.
Scholar Collaboration Expectations
Self-Sufficiency Expectations
Questions
What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream.
What’s your motivation for applying to this stream and project? (max 100 words)
Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max]
[optional] This question was added by the MATS team due to the strong signal it has provided some mentors in the past.
What are 1-3 pieces of evidence that you’d be able to do good research in this stream? (These don’t have to be standard credentials!) Please concisely describe them and why they’re relevant. Aim for 50-100 words, max 300