(cache)METR (@METR

Pinned

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

165K

A conjecture about time horizons: an LLM's probability of success at a task is better predicted by the time it takes a *team* to complete that task, than by the time it takes an individual.

very measured piece on the extraordinary exponential progress we’ve seen in AI capabilities recently — what it means, difficulties keeping up, where it might go in future. great job

@rowlsmanthorpe

! and thank you for having me on!

Quote

Rowland Manthorpe

@rowlsmanthorpe

Feb 26

Are software developers the horses of the twenty first century? New piece on the METR chart and everything it means with @Ele_Chiarella and @joel_bkr youtu.be/tYvYYFJ3Gww?si

some current views on AI-driven speedup of software engineering:

Quote

Eli Lifland

@eli_lifland

Feb 25

Replying to @METR_Evals

@joel_bkr obviously these results have high error bars but I'm curious if you have updated; if these are significant underestimates, I feel like this would imply a central uplift estimate that is higher than my impression of your views.

our existing uplift study design is broken. devs decline to work without AI (participate in experiment, submit tasks for which they expect AI to be helpful, complete tasks randomized to no AI) leading us to underestimate uplift.

Quote

METR

@METR_Evals

Feb 25

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

We are adjusting the design of our experiment going forward to address these concerns and to better estimate uplift. We expect that our measurement techniques will need to continually evolve as AI capabilities improve.

Check out our website for the full announcement and for access to the underlying data:

We are Changing our Developer Productivity Experiment Design

We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.

For this measurement, we used our Triframe scaffold as usual (not Codex). We did a partial measurement with a Codex scaffold, and our results did not seem very different. This is in line with similar comparisons we’ve run in the past.

Quote

Nikola Jurkovic

@nikolaj2030

Feb 14

I looked into how Claude Code and Codex compare to the default scaffolds METR uses for time horizon measurements. It looks like they don't significantly outperform our default scaffolds on any models we've tried them on so far.

You can find details about our measurement methodology and time horizon estimates for other models on our website:

Task-Completion Time Horizons of Frontier AI Models

Seems like a lot of people are taking this as gospel—when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we're using here was just a tiny bit different, we could've measured a time horizon of 8 hours, or 20 hours.

Quote

METR

@METR_Evals

Feb 21

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

Our team is stretched thin at the moment! To continue upper-bounding the autonomy of AI agents, and developing evaluations for monitoring AI systems and their propensity to subvert human control, we need more great engineering and research staff. Please apply below or DM me!

Quote

METR

@METR_Evals

Feb 21

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

We are working on updated methods to better track state-of-the-art AI capabilities. However, these are still in development so they don't address our immediate measurement gap. In the meantime, we advise caution in interpreting and comparing our recent time-horizon measurements.

You can find details about our measurement methodology and time-horizon estimates for other models on our website:

Task-Completion Time Horizons of Frontier AI Models

From metr.org

77K

METR

METR’s posts