Hide table of contents

This is the latest in a series of essays on AI Scaling. 
You can find the others on my site.

Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.

 

The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling:

  1. Scaling the amount of compute used for RL during training
  2. Scaling the amount of compute used for inference during deployment

We can see (1) as training the AI in more effective reasoning techniques and (2) as allowing the model to think for longer. I’ll call the first RL-scaling, and the second inference-scaling. Both new kinds of scaling were present all the way back in OpenAI’s announcement of their first reasoning model, o1, when they showed this famous chart:

I’ve previously shown that in the initial move from a base-model to a reasoning model, most of the performance gain came from unlocking the inference-scaling. The RL training did provide a notable boost to performance, even holding the number of tokens in the chain of thought fixed. You can see this RL boost in the chart below as the small blue arrow on the left that takes the base model up to the trend-line for the reasoning model. But this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost.

The question of where these capability gains come from is important because scaling up the inference compute has very different implications than scaling up the training compute. In this first round of reasoning models, they were trained with a very small amount of RL compute compared to the compute used in pre-training, meaning that the total cost of training was something like 1.01x higher than the base-model. But if most of the headline performance results require 30x as much inference compute, then the costs of deploying the those capabilities is 30x higher. Since frontier AI developers are already spending more money deploying their models than they did training them, multiplying those costs by 30x is a big deal. Moreover, these are costs that have to be paid every time you want to use the model at this level of capability, so can’t be made up in volume. 

But that was just the initial application of RL to LLMs. What happens as companies create more advanced reasoning models, using more RL?

The seeds of the answer can be found all the way back in that original o1 chart.

The chart shows steady improvements for both RL-scaling and inference-scaling, but they are not the same. Both graphs have the same y-axis and (despite the numbers being removed from the x-axis) we can see that they are both on a logarithmic x-axis covering almost exactly two orders of magnitude of scaling (100x). In both cases, the datapoints lie on a relatively straight line, which is presumably the central part of a larger S-curve. However, the slope of the RL-scaling graph (on the left) is almost exactly half that of the slope of the inference-scaling graph (on the right). When the x-axis is logarithmic, this has dramatic consequences.

The graph on the right shows that scaling inference-compute by 100x is enough to drive performance from roughly 20% to 80% on the AIME benchmark. This is pretty typical for inference scaling, where quite a variety of different models and benchmarks see performance improve from 20% to 80% when inference is scaled by 100x. 

For instance, this is what was found with Anthropic’s first reasoning model (Sonnet 3.7) on another AIME benchmark, with almost exactly the same scaling behaviour:

And ability on the ARC-AGI 1 benchmark also scales in a similar way for many of OpenAI’s different reasoning models:

We don’t always see this scaling behaviour for inference: some combinations of LLM, inference-scaling technique, and benchmark see the performance plateau below 80% or exhibit a different slope (often worse). But this climb from 20 to 80 with 100x more inference compute is pretty common (especially for reasoning-intensive benchmarks) and almost certainly what is happening on that original o1 graph.

In contrast, the slope of the RL-scaling trend is half as large, which means that it requires twice as many orders of magnitude to achieve the exact same improvement in capabilities. Increasing the RL training compute by 100x as shown in the o1 chart only improved performance from about 33% to 66%. At that rate, going from 20 to 80 would require scaling up the RL training compute by 10,000x.

We can confirm this trend — and that it continued beyond o1 — by looking at the following graph from the o3 launch video (with a line added showing the slope corresponding to going from 20 to 80 in 10,000x):

Using another version of the AIME benchmark, this shows o1’s training progress over 3 orders of magnitude and o3’s training over a further order of magnitude. In total, we see that scaling up the RL-training by 4 orders of magnitude takes the model from about 26% to 88%. This provides some confirmation for the rule-of-thumb that a 10,000x scale-up in RL training compute is required to improve this benchmark performance from 20 to 80.

To my knowledge, OpenAI hasn’t provided RL-training curves for other benchmarks, but they do have charts comparing o1 with o3 and o3 with GPT-5 at different inference-scaling levels on several benchmarks. Given that o3 used about 10x as much RL training as o1, we’d expect the RL boost going from o1 to o3 to be worth about the same as the inference boost of giving o1 just half an order of magnitude more inference (~3x as many tokens). And this is indeed what one sees on their performance/token graph comparing the two:

Similarly, o3 also requires about 3x as many tokens to match GPT-5 on the SWE-bench and GPQA Diamond benchmarks. This would fit the expected pattern of GPT-5 having been trained with a further 10x as much RL training compute as o3:

It is hard to verify that this trend holds for models from other companies, as this data on training curves for cutting-edge models is often treated as confidential. But the fact that other leading labs’ base models and reasoning models are roughly on par with OpenAI’s suggests none of them are scaling notably better than this.

So the evidence on RL-scaling and inference-scaling supports a general pattern:

  • a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
  • a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference

In general, to get the same benefit from RL-scaling as from inference-scaling required twice as many orders of magnitude. That’s not good.

How do these compare to pre-training scaling?

The jumps from GPT-1 to 2 to 3 to 4 each involved scaling up the pre-training compute by about 100x. How much of the RL-scaling or inference-scaling would be required to give a similar boost? While I can’t say for sure, we can put together the clues we have and take an educated guess.

Jones (2021) and EpochAI both estimate that you need to scale-up inference by roughly 1,000x to reach the same capability you’d get from a 100x scale-up of training. And since the evidence from o1 and o3 suggests we need about twice as many orders of magnitude of RL-scaling compared with inference-scaling, this implies we need something like a 1,000,000x scale-up of total RL compute to give a boost similar to a GPT level.

This is breathtakingly inefficient scaling. But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.

Yet despite the poor scaling behaviour, RL training has so far been a good deal. This is solely because the scaling of RL compute began from such a small base compared with the massive amount of pre-training compute invested in today’s models. While AI labs are reticent to share information about how much compute has actually been spent on RL (witness the removal of all numbers from the twin o1 scaling graphs), it is widely believed that even the 10,000x RL-scaling we saw for o3’s training still ended up using much less compute than the ~ FLOP spent on pre-training. This means that OpenAI (and their competitors) have effectively got those early gains from RL-training for free. 

For example, if the 10x scaling of RL compute from o1 to o3 took them from a total of 1.01x the pre-training compute to 1.1x, then the 10x scale-up came at the price of a 1.1x scale-up in overall training costs. If that gives the same performance boost as using 3x as many reasoning tokens (which would multiply all deployment costs of reasoning models by 3) then it is a great deal for a company that deploys its model so widely.

But this changes dramatically once RL-training reaches and then exceeds the size of the pre-training compute. In July 2025, xAI’s Grok 4 launch video included a chart suggesting that they had reached this level (where pre-training compute is shown in white and RL-training compute in orange):

Scaling RL by another 10x beyond this point increases the total training compute by 5.5x, and beyond that it is basically the full 10x increase to all training costs. So this is the point where the fact that they get much less for a 10x scale-up of RL compute compared with 10x scale-ups in pre-training or inference really bites. I estimate that at the time of writing (Oct 2025), we’ve already seen something like a 1,000,000x scale-up in RL training and it required ≤2x the total training cost. But the next 1,000,000x scale-up would require 1,000,000x the total training cost, which is not possible in the foreseeable future.

Grok 4 was trained on 200,000 GPUs located in xAI’s vast Colossus datacenter. To achieve the equivalent of a GPT-level jump through RL would (according to the rough scaling relationships above) require 1,000,000x the total training compute. To put that in perspective, it would require replacing every GPU in their datacenter with 5 entirely new datacenters of the same size, then using 5 years worth of the entire world’s electricity production to train the model. So it looks infeasible for further scaling of RL-training compute to give even a single GPT-level boost.

I don’t think OpenAI, Google, or Anthropic have quite reached the point where RL training compute matches the pre-training compute. But they are probably not far off. So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.

Conclusion

The shift towards RL allowed the scaling era to continue even after pre-training scaling had stalled. It did so via two different mechanisms: scaling up the RL training compute and scaling up the inference compute. 

Scaling RL training allowed the model to learn for itself how to achieve better performance. Unlike the imitation learning of next-token-prediction, RL training has a track record of allowing systems to burst through the human level — finding new ways of solving problems that go beyond its training data. But in the context of LLMs, it scales poorly. We’ve seen impressive gains, but these were only viable when starting from such a low base. We have reached the point where it is too expensive to go much further. 

This leaves us with inference-scaling as the remaining form of compute-scaling. RL helped enable inference-scaling via longer chain of thought and, when it comes to LLMs, that may be its most important legacy. But inference-scaling has very different dynamics to scaling up the training compute. For one thing, it scales up the flow of ongoing costs instead of scaling the one-off training cost. This has many consequences for AI deployment, AI risk, and AI governance

But perhaps more importantly, inference-scaling is really a way of improving capabilities by allowing the model more time to solve the problem, rather than by increasing its intelligence. Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.

No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 11m read
 · 
> DMT, the smallest microdose of maybe 5mg with a vape pen stops the worst pain known, in literally 10–20 seconds. Acid and mushrooms as well, but they take time to come up. Even the smallest sub-perceptual dose of DMT stops the pain. > > - Yiftach Yerushalmy, cluster headache patient   > One inhalation [of DMT] will end the attack for most people. Everybody is reporting the exact same thing. […] It could end that attack in less than a minute. […] You can take one inhalation, you can wait 30 seconds, and if that cluster is not gone completely, then you know it's time to take another inhalation. You don't have to wait 2h into a psilocybin trip. > > - Bob Wold, president of Clusterbusters and cluster headache patient   > I use the word game changer because a lot of times our attacks will come at night after we go to sleep. Usually after an attack, half an hour to 40 minutes later, I try to go back to sleep knowing full well that in another hour or so I'm going to get another attack. But having a DMT vape pen right next to my bedside allowed me to hit that pen and within a minute or two I'm closing my eyes and going back to sleep. > > - Joe McKay, retired NYC firefighter, cluster headache patient and advocate   > One of the most incredible experiences of my life was when I first aborted a cluster headache with DMT. That feeling of going from a place of excruciating pain… and feeling the pain fizzle away and die in a matter of seconds. > > Cluster headache patient If we knew that thousands of Americans[1] were being tortured in prison camps every year, no political issue would be more important than making sure to end such a tragedy as quickly as possible, especially if cheap solutions existed. Yet this is essentially the reality for cluster headache patients. Eradicating the torture caused by cluster headaches globally demands finding the most effective treatments and making them universally accessible to sufferers as soon as possible. In recent years, DMT
 ·  · 9m read
 · 
Many thanks to @Felix_Werdermann 🔸 @Engin Arıkan and @Ana Barreiro for your feedback and comments on this, and for the encouragement from many people to finally write this up into an EA forum post. For years, much of the career advice in the Effective Altruism community has implicitly (or explicitly) suggested that impact = working at an EA nonprofit. That narrative made sense when the community and its talent pool were smaller. But as EA grows, it’s worth reassessing whether we’re overconcentrating on nonprofit careers, a trend that may be limiting our community’s impact and leaving higher-leverage opportunities on the table. Why Now? As the EA movement has grown, it has attracted far more talent than the nonprofit sector can realistically absorb. This creates an urgent need to develop alternative pathways for talented, mission-aligned people. Under the current status quo, many end up feeling frustrated after going through multiple highly competitive recruitment rounds with little chance of success. We see this reflected both in community discussions (e.g. numerous EA Forum posts here, here, here and in our own career advising sessions. Beyond these strategic pressures, there are also good reasons to believe that even if the talent bottleneck did not exist, diversifying career paths would still be valuable for advancing the cause. A broader distribution of talent increases our ability to influence powerful institutions, spread ideas, and embed animal-focused perspectives in places where they otherwise would never appear.  In short, this is both a response to immediate pressures within growing talented job seekers in EA and a proactive strategy to accelerate change for the animal welfare movement. By diversifying where our talent goes, we make the movement more resilient, more influential, and better equipped to achieve lasting impact.  Important Caveats This argument is not universal. In fact, there are clear exceptions where nonprofit roles remain both hig
 ·  · 8m read
 · 
In the spirit of Draft Amnesty Week, and in light of Ambitious Impact currently hiring for staff roles - one recruitment manager (or director) and two researchers - I thought I’d share some of my recent reflections on finding impactful careers as a new recruiter in an EA org.   “Why is it so hard to get hired to do good?” First, I want to validate this point and say that I totally agree; it is (stupidly) hard. I personally applied to 16 EA orgs and spent >100 hours job searching, doing work tests, interviews, etc., before I got this role. This was after having had four EA part-time roles, served on a board of one EA organisation for two years, and been community-building for years. I am not the only person to have this experience, as you can see from a quick search on the EA Forum here, here, and here. Personally, I’m excited about a world where people approach impact from many angles. It’s easy to get tunnel vision on ‘EA jobs’ - they align with your values and seem like the best way to contribute. But zooming out, the goal isn’t to be hired by an EA org; it’s to have the greatest counterfactual impact. For many people, that might mean roles outside EA, in government, academia, or industry, where their skills fill gaps that EA orgs can’t - or by donating effectively and taking the Giving What We Can pledge to fund more impactful interventions. For a take on this topic that resonates with me, see Lauren Mee’s recent post.    So…should we give up on getting jobs in EA orgs? No, I don’t think we necessarily should, and I am not alone in thinking so. I recommend this post for a good write-up to keep you motivated while you’re figuring out where and how you can have the highest counterfactual impact. Resilience paying off is also well exemplified by the founder cohorts of our Charity Entrepreneurship Incubation Program. Many of the applicants who eventually found a nonprofit were previously rejected before being admitted in a later round. It wasn’t that they were