METR

471 posts
Square profile picture and Opens profile photo
METR
@METR_Evals
We work to scientifically measure whether and when AI systems might threaten catastrophic harm to society. Nonprofit.

METR’s posts

Pinned
Square profile picture
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.
Image
A conjecture about time horizons: an LLM's probability of success at a task is better predicted by the time it takes a *team* to complete that task, than by the time it takes an individual.
very measured piece on the extraordinary exponential progress we’ve seen in AI capabilities recently — what it means, difficulties keeping up, where it might go in future. great job ! and thank you for having me on!
Quote
Rowland Manthorpe
@rowlsmanthorpe
Are software developers the horses of the twenty first century? New piece on the METR chart and everything it means with @Ele_Chiarella and @joel_bkr youtu.be/tYvYYFJ3Gww?si
some current views on AI-driven speedup of software engineering: 🧵
Quote
Eli Lifland
@eli_lifland
Replying to @METR_Evals
@joel_bkr obviously these results have high error bars but I'm curious if you have updated; if these are significant underestimates, I feel like this would imply a central uplift estimate that is higher than my impression of your views.
our existing uplift study design is broken. devs decline to work without AI (participate in experiment, submit tasks for which they expect AI to be helpful, complete tasks randomized to no AI) leading us to underestimate uplift.
Quote
METR
@METR_Evals
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.
Image
Square profile picture
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.
Image
Square profile picture
We are adjusting the design of our experiment going forward to address these concerns and to better estimate uplift. We expect that our measurement techniques will need to continually evolve as AI capabilities improve.
Square profile picture
We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.
Image
Square profile picture
For this measurement, we used our Triframe scaffold as usual (not Codex). We did a partial measurement with a Codex scaffold, and our results did not seem very different. This is in line with similar comparisons we’ve run in the past.
Quote
Nikola Jurkovic
@nikolaj2030
I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don't significantly outperform our default scaffolds on any models we've tried them on so far.
Image
Seems like a lot of people are taking this as gospel—when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we're using here was just a tiny bit different, we could've measured a time horizon of 8 hours, or 20 hours.
Quote
METR
@METR_Evals
We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.
Image
Our team is stretched thin at the moment! To continue upper-bounding the autonomy of AI agents, and developing evaluations for monitoring AI systems and their propensity to subvert human control, we need more great engineering and research staff. Please apply below or DM me!
Quote
METR
@METR_Evals
We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.
Image
Square profile picture
We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.
Image
Square profile picture
We are working on updated methods to better track state-of-the-art AI capabilities. However, these are still in development so they don't address our immediate measurement gap. In the meantime, we advise caution in interpreting and comparing our recent time-horizon measurements.