AGI by 2032 is extremely unlikely

Yarrow Bouchard🔸

Note: This is a fairly rough post I adapted from some comments I recently wrote that I worked hard enough on that I figured I should probably make them into a post. So, although this post is technically not a draft, it isn't written how I would write a post — it's less polished and more off the cuff. If you think I should remove the Draft Amnesty tag, please say so, and I will!

I very, very strongly believe there’s essentially no chance of artificial general intelligence (AGI) being developed within the next 7 years.^[1] Previously, I wrote a succinct list of reasons to doubt near-term AGI here. For example, it might come as a surprise that around three quarters of AI experts don’t think scaling up large language models (LLMs) will lead to AGI!

For a more technical and theoretical view (from someone with better credentials), I highly recommend this recent video by an AI researcher that makes a strong case for skepticism of near-term AGI and of LLMs as a path to AGI:

In this post, I will give more detailed arguments about why I think near-term AGI is so unlikely and why I think LLMs won't scale to AGI.

Clarification of terms

By essentially no chance, I mean less than the chance of Jill Stein running as the Green Party candidate in 2028 and winning the U.S. Presidency. Or, if you like, less than the chance Jill Stein had of winning the presidency in 2024. I mean it’s an incredible long shot, significantly less than a 0.1% chance. (And if I had to give a number on my confidence of this, it would be upwards of 95%.)

By artificial general intelligence (AGI), I mean a system that can think, plan, learn, and solve problems just like humans do, with:

At least an equal level of data efficiency (e.g. if a human can learn from one example, AGI must also be able to learn from one example and, not, say, one million)
At least an equal level of reliability (e.g. if humans do a task correctly 99.999% of the time, AGI must match or exceed that)
At least an equal level of fluidity or adaptability to novel problems and situations (e.g. if a human can solve a problem with zero training examples, AGI must be able to as well)
At least an equal ability to generate novel and creative ideas

This is the only kind of AI system that could plausibly automate all human labour or cunningly take over the world. No existing AI system is anything like this.

Some recent AI predictions that turned out to be wrong

In March 2025, Dario Amodei, the CEO of Anthropic, predicted that 90% of code would be written by AI as early as June 2025 and no later than September 2025. This turned out to be dead wrong.

In 2016, the renowned AI researcher Geoffrey Hinton predicted that by 2021, AI would automate away all radiology jobs and that turned out to be dead wrong. Even 9 years later, the trend has moved in the opposite direction and there is no indication radiology jobs will be automated away anytime soon.

Many executives and engineers working in autonomous driving predicted we’d have widespread fully autonomous vehicles long ago; some of them have thrown in the towel. Cruise Automation, for example, is no more.

Given this, we should be skeptical when people argue that AI will soon have capabilities far exceeding the ability to code, the ability to do radiology, and the ability to drive.

Benchmarks do not test general intelligence

One of the main arguments for near-term AGI is the performance of LLMs on various benchmarks. However, LLM benchmarks don't accurately represent what real intellectual tasks are actually like.

First, the benchmarks are designed to be solvable by LLMs because they are primarily intended to measure LLMs against each other and to measure improvements in subsequent versions of the same LLM model line (e.g. GPT-5 vs GPT-4o). There isn’t much incentive to create LLM benchmarks where LLMs stagnate around 0%.^[2]

Second, for virtually all LLMs tests or benchmarks, the definition of success or failure on the tasks has to be reduced to something simple enough that software can grade the task automatically. This is a big limitation. When I think about the sort of intellectual tasks that humans do, not a lot of them can be graded automatically.

Of course, there are written exams and tests with multiple choice answers, but these are primarily tests of memorization. We want AI systems that go beyond just memorizing things from huge numbers of examples. We want AI systems that can solve completely problems that aren’t a close match for anything in their training dataset. That’s where LLMs are incredibly brittle and start just generating nonsense, saying plainly false (and often ridiculous) things, contradicting themselves, hallucinating, etc.

François Chollet, an AI researcher formerly at Google, gives some great examples of LLM brittleness in a talk here. Chollet also explains how these holes in LLM reasoning get manually patched by paying large workforces to write new training examples specifically to fix them. This creates an impression of increased intelligence, but the improvement isn't from scaling in these cases or from increases in general intelligence, it's from large-scale special casing.

AI in the economy

I think the most robust tests of AI capabilities are tasks that have real world value. If AI systems are actually doing the same intellectual tasks as human beings, then we should see AI systems either automating labour or increasing worker productivity. But we don’t see that.

I’m aware of two studies that looked at the impact of AI assistance on human productivity. One study on customer support workers found mixed results, including a negative impact on productivity for the most experienced employees. Another study, by METR, found a 19% reduction in productivity when coders used an AI coding assistant.

Non-AI companies that have invested in applying AI to the work they do are not seeing that much payoff. There might be some modest benefits in some niches. I’m sure there are at least a few. But LLMs are mostly turning out to be a disappointment in terms of their economic usefulness.^[3]

If you think an LLM scoring more than 100 on an IQ test means it's AGI, then we've had AGI for several years. But clearly there's a problem with that inference, right? Memorizing the answers to IQ tests, or memorizing similar answers to similar questions that you can interpolate, doesn't mean a system actually has the kind of intelligence to solve completely novel problems that have never appeared on any test, or in any text. The same general critique applies to the inference that LLMs are intelligent from their results on virtually any LLM benchmark. Memorization is not intelligence.

If we instead look at performance on practical, economically valuable tasks as the test for AI's competence at intellectual tasks, then its competence appears quite poor. People who make the flawed inference from benchmarks just described say that LLMs can do basically anything. If they instead derived their assessment from LLMs' economic usefulness, it would be closer to the truth to say LLMs can do almost nothing.

AI vs. puzzles

There is also some research on non-real world tasks that supports the idea that LLMs are mass-scale memorizers with a modicum of interpolation or generalization to examples similar to what they've been trained on, rather than genuinely intelligent (in the sense that humans are intelligent). The Apple paper on "reasoning" models found surprisingly mixed results on common puzzles. The finding that sticks out most in my mind is that the LLM's performance on the Tower of Hanoi puzzle did not improve after being told the algorithm for solving the puzzle. Is that real intelligence?

Problems with scaling up LLMs

Predictions of near-term AGI typically rely on scaling up LLMs. However, there is evidence that scaling LLMs is running out of steam:

Toby Ord's interview on the 80,000 Hours Podcast in June covered this topic really well. I highly recommend it.
Renowned AI researcher Ilya Sutskever, formerly the chief scientist at OpenAI (prior to voting to fire Sam Altman), has said he thinks the benefits from scaling LLM pre-training have plateaued.
There have been reports that, internally, employees at AI labs like OpenAI are disappointed with their models' progress.
GPT-5 doesn't seem like that much of an improvement over previous models.
Epoch AI's median estimate of when LLMs will run out of data to train on is 2028.
Epoch AI also predicts that compute scaling will slow down mainly due to how expensive it is and how wasteful it would be to overbuild.

It seems like there is less juice to squeeze from further scaling at the same time that squeezing is getting harder. And there may be an absolute limit to data scaling.

Agentic AI is unsuccessful

If you expand the scope of LLM performance beyond written prompts and responses to "agentic" applications, I think LLMs' failures are more stark and the models do not seem to be gaining mastery of these tasks particularly quickly. Journalists generally say that companies' demos of agentic AI don't work.

I don't expect that performance on agentic tasks will rapidly improve. To train on text-based tasks, AI labs can get data from millions of books and large-scale scrapes of the Internet. There aren't similarly sized datasets for agentic tasks.

In principle, you can use pure reinforcement learning without bootstrapping from imitation learning. However, while this approach has succeeded in domains with smaller spaces of possible actions like go, it has failed in domains with larger spaces of possible actions like StarCraft. There hasn’t been any recent progress in reinforcement learning that would change this and if someone is expecting a breakthrough sometime soon, they should explain why.

The current discrepancy between LLM performance on text-based tasks and agentic tasks also tells us something about whether LLMs are genuinely intelligent. What kind of PhD student can't use a computer?

Conclusion

So, to summarize the core points of this post:

LLM benchmarks don't really tell us how genuinely intelligent LLMs are. They are designed to be easy for LLMs and to be automatically graded, which limits what can be tested.
On economically valuable tasks in real world settings, which I believe are much better tests than benchmarks, LLMs do quite poorly. This makes near-term AGI seem very unlikely.
LLMs fail all the time at tasks we would not expect them to fail at if they were genuinely intelligent, as opposed to relying on mass-scale memorization.
Scaling isn't a solution to the fundamental flaws in LLMs and, in any case, the benefits of scaling are diminishing at the same time that LLM companies are encountering practical limits that may slow compute scaling and slow or even stop data scaling.
LLMs are terrible at agentic tasks and there isn't enough training data for them to improve, if training data is what it takes. If LLMs are genuinely intelligent, we should ask why they can't learn agentic tasks from a small number of examples, since this is what humans do.

Eventually, there will be some AI paradigm beyond LLMs that is better at generality or generalization. However, we don't know what that paradigm is yet and there's no telling how long it will take to be discovered. Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years.

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain. We don't reslly know how much computation the brain uses or what the equivalent would be for computers. Estimates vary a lot. A common figure that gets thrown around is that a human brain does about 1 exaflop of computation, which is about what each of the top three supercomputers in the world do. While building a supercomputer or the equivalent in a datacentre is totally feasible for AI labs deploying commercial products, it's not feasible for most AI researchers who want to test and iterate on new, unproven ideas.

A much more pessimistic estimate is that to match the computation of a human brain, we would need to run computers that consume somewhere between about 300 and 300 billion times as much electricity as the entire United States. I hope this isn't true because it seems like it would make progress on AI and brain simulation really hard, but I don't know.

There are more points I could touch on, some of which I briefly mentioned in a previous post. But I think, for now, this is enough.

^{^}
I'm focused on the 7-year time horizon because this is what seems most relevant for effective altruism, given that a lot of people in EA seem to believe AGI will be created in 2-7 years.
^{^}
However, it would be easy to do so, especially if you're willing to do manual grading. Task an LLM with making stock picks that achieve alpha — you could grade that automatically. Try to coax LLMs into coming up with a novel scientific discovery or theoretical insight. Despite trillions of tokens generated, it hasn't happened yet. Tasks related to computer use and "agentic" use cases are also sure to lead to failures. For example, make it play a video game it's never seen before (e.g. because the game just came out). Or, if the game is slow-paced enough, simply give you instructions on how to play. You can abstract out the computer vision aspect of these tests if you want, although it's worth asking how we're going to have AGI if it can't see.
^{^}
From a Reuters article published recently:
A BofA Global Research's monthly fund manager survey revealed that 54% of investors said they thought that AI stocks were in a bubble compared with 38% who do not believe that a bubble exists.
However, you'd think if this accurately reflected the opinions of people in finance, the bubble would have popped already.

Steven Byrnes12h41

Suppose someone said to you in 2018:

There’s an AI paradigm that almost nobody today has heard of or takes seriously. In fact, it’s little more than an arxiv paper or two. But in seven years, people will have already put hundreds of billions of dollars and who knows how many gazillions of hours into optimizing and running the algorithms; indeed, there will be literally 40,000 papers about this paradigm already posted on arxiv. Oh and y’know how right now world experts deploying bleeding-edge AI technology cannot make an AI that can pass an 8th grade science test? Well y’know, in seven years, this new paradigm will lead to AIs that can nail not only PhD qualifying exams in every field at once, but basically every other written test too, including even the international math olympiad with never-before-seen essay-proof math questions. And in seven years, people won’t even be talking about the Turing test anymore, because it’s so obviously surpassed. And… [etc. etc.]

I think you would have read that paragraph in 2018, and described it as “extremely unlikely”, right? It just sounds completely absurd. How could all that happen in a mere seven years? No way.

But that’s what happened!

So I think you should have wider error bars around how long it takes to develop a new AI paradigm from obscurity to AGI. It can be long, it can be short, who knows.

(My actual opinion is that this kind of historical comparison understates how quickly a new AI paradigm could develop, because right now we have lots of resources that did not exist in 2018, like dramatically more compute, better tooling and frameworks like PyTorch and JAX, armies of experts on parallelization, and on and on. These were bottlenecks in 2018, without which we presumably would have gotten the LLMs of today years earlier.)

(My actual actual opinion is that superintelligence will seem to come almost out of nowhere, i.e. it will be just lots of obscure arxiv papers until superintelligence is imminent. See here. But if you don’t buy that strong take, fine, go with the weaker argument above.)

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain.

My own controversial opinion is that the human brain requires much less compute than the LLMs of today. Details here. You don’t have to believe me, but you should at least have wide error bars around this parameter, which makes it harder to argue for a bottom line of “extremely unlikely”. See also Joe Carlsmith’s report which gives a super wide range.

NunoSempere14h20

At least an equal level of data efficiency
...
This is the only kind of AI system that could plausibly automate all human labour

Your bar is too high, you can automate all human labour with less data efficiency.

Davidmanheim4h7

AI will hunt down the last remaining human, and with his last dying breath, humanity will end - not with a bang, but with a "you don't really count as AGI"

Yarrow Bouchard🔸14h9

This apparently isn’t true for autonomous driving and it’s probably even less true in a lot of other domains. If an AI system can’t respond well to novelty, it can’t function in the world because novelty occurs all the time. For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?

Davidmanheim4h12

For autonomous driving, current approaches which "can't deal with novelty" are already far safer than human drivers.

Charlie_Guthmann10h3

For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?

The bar is much lower because they are 100x faster and 1000x cheaper than me. They open up a bunch of brute forceable techniques in the same way that you can open up https://projecteuler.net/ solve many of eulers discoveries with little math knowledge but basic python and for loops.

Math -> re read every arxiv paper -> translate them all into lean -> aggregate every open well specificied math problem -> use the database of all previous learnings to see if you can chain chunks of previous problems together to solve.

clinical medicine -> re-read every RCT ever done and comprehensively rank intervention effectiveness by disease -> find cost data where available and rank the cost/qaly of all disease/intervention space

Econometrics -> aggregate every natural experiment and instrumental variable ever used in an econometrics paper -> think about other use cases for these tools -> search if other use cases have available data -> reapply the general theory of the original paper with the new data.

Rasool13m2

Amodei claims that 90% of code at Anthropic (and some companies they work with) is being written by AI

Effective Altruism Forum
EA Forum