AGI by 2032 is extremely unlikely

Yarrow Bouchard🔸

Note: This is a fairly rough post I adapted from some comments I recently wrote that I worked hard enough on that I figured I should probably make them into a post. So, although this post is technically not a draft, it isn't written how I would write a post — it's less polished and more off the cuff. If you think I should remove the Draft Amnesty tag, please say so, and I will!

I very, very strongly believe there’s essentially no chance of artificial general intelligence (AGI) being developed within the next 7 years.^[1] Previously, I wrote a succinct list of reasons to doubt near-term AGI here. For example, it might come as a surprise that around three quarters of AI experts don’t think scaling up large language models (LLMs) will lead to AGI!

For a more technical and theoretical view (from someone with better credentials), I highly recommend this recent video by an AI researcher that makes a strong case for skepticism of near-term AGI and of LLMs as a path to AGI:

In this post, I will give more detailed arguments about why I think near-term AGI is so unlikely and why I think LLMs won't scale to AGI.

Clarification of terms

By essentially no chance, I mean less than the chance of Jill Stein running as the Green Party candidate in 2028 and winning the U.S. Presidency. Or, if you like, less than the chance Jill Stein had of winning the presidency in 2024. I mean it’s an incredible long shot, significantly less than a 0.1% chance. (And if I had to give a number on my confidence of this, it would be upwards of 95%.)

By artificial general intelligence (AGI), I mean a system that can think, plan, learn, and solve problems just like humans do, with:

At least an equal level of data efficiency (e.g. if a human can learn from one example, AGI must also be able to learn from one example and, not, say, one million)
At least an equal level of reliability (e.g. if humans do a task correctly 99.999% of the time, AGI must match or exceed that)
At least an equal level of fluidity or adaptability to novel problems and situations (e.g. if a human can solve a problem with zero training examples, AGI must be able to as well)
At least an equal ability to generate novel and creative ideas

This is the only kind of AI system that could plausibly automate all human labour or cunningly take over the world. No existing AI system is anything like this.

Some recent AI predictions that turned out to be wrong

In March 2025, Dario Amodei, the CEO of Anthropic, predicted that 90% of code would be written by AI as early as June 2025 and no later than September 2025. This turned out to be dead wrong.

In 2016, the renowned AI researcher Geoffrey Hinton predicted that by 2021, AI would automate away all radiology jobs and that turned out to be dead wrong. Even 9 years later, the trend has moved in the opposite direction and there is no indication radiology jobs will be automated away anytime soon.

Many executives and engineers working in autonomous driving predicted we’d have widespread fully autonomous vehicles long ago; some of them have thrown in the towel. Cruise Automation, for example, is no more.

Given this, we should be skeptical when people argue that AI will soon have capabilities far exceeding the ability to code, the ability to do radiology, and the ability to drive.

Benchmarks do not test general intelligence

One of the main arguments for near-term AGI is the performance of LLMs on various benchmarks. However, LLM benchmarks don't accurately represent what real intellectual tasks are actually like.

First, the benchmarks are designed to be solvable by LLMs because they are primarily intended to measure LLMs against each other and to measure improvements in subsequent versions of the same LLM model line (e.g. GPT-5 vs GPT-4o). There isn’t much incentive to create LLM benchmarks where LLMs stagnate around 0%.^[2]

Second, for virtually all LLMs tests or benchmarks, the definition of success or failure on the tasks has to be reduced to something simple enough that software can grade the task automatically. This is a big limitation. When I think about the sort of intellectual tasks that humans do, not a lot of them can be graded automatically.

Of course, there are written exams and tests with multiple choice answers, but these are primarily tests of memorization. We want AI systems that go beyond just memorizing things from huge numbers of examples. We want AI systems that can solve completely problems that aren’t a close match for anything in their training dataset. That’s where LLMs are incredibly brittle and start just generating nonsense, saying plainly false (and often ridiculous) things, contradicting themselves, hallucinating, etc.

François Chollet, an AI researcher formerly at Google, gives some great examples of LLM brittleness in a talk here. Chollet also explains how these holes in LLM reasoning get manually patched by paying large workforces to write new training examples specifically to fix them. This creates an impression of increased intelligence, but the improvement isn't from scaling in these cases or from increases in general intelligence, it's from large-scale special casing.

AI in the economy

I think the most robust tests of AI capabilities are tasks that have real world value. If AI systems are actually doing the same intellectual tasks as human beings, then we should see AI systems either automating labour or increasing worker productivity. But we don’t see that.

I’m aware of two studies that looked at the impact of AI assistance on human productivity. One study on customer support workers found mixed results, including a negative impact on productivity for the most experienced employees. Another study, by METR, found a 19% reduction in productivity when coders used an AI coding assistant.

Non-AI companies that have invested in applying AI to the work they do are not seeing that much payoff. There might be some modest benefits in some niches. I’m sure there are at least a few. But LLMs are mostly turning out to be a disappointment in terms of their economic usefulness.^[3]

If you think an LLM scoring more than 100 on an IQ test means it's AGI, then we've had AGI for several years. But clearly there's a problem with that inference, right? Memorizing the answers to IQ tests, or memorizing similar answers to similar questions that you can interpolate, doesn't mean a system actually has the kind of intelligence to solve completely novel problems that have never appeared on any test, or in any text. The same general critique applies to the inference that LLMs are intelligent from their results on virtually any LLM benchmark. Memorization is not intelligence.

If we instead look at performance on practical, economically valuable tasks as the test for AI's competence at intellectual tasks, then its competence appears quite poor. People who make the flawed inference from benchmarks just described say that LLMs can do basically anything. If they instead derived their assessment from LLMs' economic usefulness, it would be closer to the truth to say LLMs can do almost nothing.

AI vs. puzzles

There is also some research on non-real world tasks that supports the idea that LLMs are mass-scale memorizers with a modicum of interpolation or generalization to examples similar to what they've been trained on, rather than genuinely intelligent (in the sense that humans are intelligent). The Apple paper on "reasoning" models found surprisingly mixed results on common puzzles. The finding that sticks out most in my mind is that the LLM's performance on the Tower of Hanoi puzzle did not improve after being told the algorithm for solving the puzzle. Is that real intelligence?

Problems with scaling up LLMs

Predictions of near-term AGI typically rely on scaling up LLMs. However, there is evidence that scaling LLMs is running out of steam:

Toby Ord's interview on the 80,000 Hours Podcast in June covered this topic really well. I highly recommend it.
Renowned AI researcher Ilya Sutskever, formerly the chief scientist at OpenAI (prior to voting to fire Sam Altman), has said he thinks the benefits from scaling LLM pre-training have plateaued.
There have been reports that, internally, employees at AI labs like OpenAI are disappointed with their models' progress.
GPT-5 doesn't seem like that much of an improvement over previous models.
Epoch AI's median estimate of when LLMs will run out of data to train on is 2028.
Epoch AI also predicts that compute scaling will slow down mainly due to how expensive it is and how wasteful it would be to overbuild.

It seems like there is less juice to squeeze from further scaling at the same time that squeezing is getting harder. And there may be an absolute limit to data scaling.

Agentic AI is unsuccessful

If you expand the scope of LLM performance beyond written prompts and responses to "agentic" applications, I think LLMs' failures are more stark and the models do not seem to be gaining mastery of these tasks particularly quickly. Journalists generally say that companies' demos of agentic AI don't work.

I don't expect that performance on agentic tasks will rapidly improve. To train on text-based tasks, AI labs can get data from millions of books and large-scale scrapes of the Internet. There aren't similarly sized datasets for agentic tasks.

In principle, you can use pure reinforcement learning without bootstrapping from imitation learning. However, while this approach has succeeded in domains with smaller spaces of possible actions like go, it has failed in domains with larger spaces of possible actions like StarCraft. There hasn’t been any recent progress in reinforcement learning that would change this and if someone is expecting a breakthrough sometime soon, they should explain why.

The current discrepancy between LLM performance on text-based tasks and agentic tasks also tells us something about whether LLMs are genuinely intelligent. What kind of PhD student can't use a computer?

Conclusion

So, to summarize the core points of this post:

LLM benchmarks don't really tell us how genuinely intelligent LLMs are. They are designed to be easy for LLMs and to be automatically graded, which limits what can be tested.
On economically valuable tasks in real world settings, which I believe are much better tests than benchmarks, LLMs do quite poorly. This makes near-term AGI seem very unlikely.
LLMs fail all the time at tasks we would not expect them to fail at if they were genuinely intelligent, as opposed to relying on mass-scale memorization.
Scaling isn't a solution to the fundamental flaws in LLMs and, in any case, the benefits of scaling are diminishing at the same time that LLM companies are encountering practical limits that may slow compute scaling and slow or even stop data scaling.
LLMs are terrible at agentic tasks and there isn't enough training data for them to improve, if training data is what it takes. If LLMs are genuinely intelligent, we should ask why they can't learn agentic tasks from a small number of examples, since this is what humans do.

Eventually, there will be some AI paradigm beyond LLMs that is better at generality or generalization. However, we don't know what that paradigm is yet and there's no telling how long it will take to be discovered. Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years.

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain. We don't reslly know how much computation the brain uses or what the equivalent would be for computers. Estimates vary a lot. A common figure that gets thrown around is that a human brain does about 1 exaflop of computation, which is about what each of the top three supercomputers in the world do. While building a supercomputer or the equivalent in a datacentre is totally feasible for AI labs deploying commercial products, it's not feasible for most AI researchers who want to test and iterate on new, unproven ideas.

A much more pessimistic estimate is that to match the computation of a human brain, we would need to run computers that consume somewhere between about 300 and 300 billion times as much electricity as the entire United States. I hope this isn't true because it seems like it would make progress on AI and brain simulation really hard, but I don't know.

There are more points I could touch on, some of which I briefly mentioned in a previous post. But I think, for now, this is enough.

^{^}
I'm focused on the 7-year time horizon because this is what seems most relevant for effective altruism, given that a lot of people in EA seem to believe AGI will be created in 2-7 years.
^{^}
However, it would be easy to do so, especially if you're willing to do manual grading. Task an LLM with making stock picks that achieve alpha — you could grade that automatically. Try to coax LLMs into coming up with a novel scientific discovery or theoretical insight. Despite trillions of tokens generated, it hasn't happened yet. Tasks related to computer use and "agentic" use cases are also sure to lead to failures. For example, make it play a video game it's never seen before (e.g. because the game just came out). Or, if the game is slow-paced enough, simply give you instructions on how to play. You can abstract out the computer vision aspect of these tests if you want, although it's worth asking how we're going to have AGI if it can't see.
^{^}
From a Reuters article published recently:
A BofA Global Research's monthly fund manager survey revealed that 54% of investors said they thought that AI stocks were in a bubble compared with 38% who do not believe that a bubble exists.
However, you'd think if this accurately reflected the opinions of people in finance, the bubble would have popped already.

Effective Altruism Forum
EA Forum