AGI by 2032 is extremely unlikely

Yarrow Bouchard🔸

Note: This is a fairly rough post I adapted from some comments I recently wrote that I worked hard enough on that I figured I should probably make them into a post. So, although this post is technically not a draft, it isn't written how I would write a post — it's less polished and more off the cuff. If you think I should remove the Draft Amnesty tag, please say so, and I will!

I very, very strongly believe there’s essentially no chance of artificial general intelligence (AGI) being developed within the next 7 years.^[1] Previously, I wrote a succinct list of reasons to doubt near-term AGI here. For example, it might come as a surprise that around three quarters of AI experts don’t think scaling up large language models (LLMs) will lead to AGI!

For a more technical and theoretical view (from someone with better credentials), I highly recommend this recent video by an AI researcher that makes a strong case for skepticism of near-term AGI and of LLMs as a path to AGI:

In this post, I will give more detailed arguments about why I think near-term AGI is so unlikely and why I think LLMs won't scale to AGI.

Clarification of terms

By essentially no chance, I mean less than the chance of Jill Stein running as the Green Party candidate in 2028 and winning the U.S. Presidency. Or, if you like, less than the chance Jill Stein had of winning the presidency in 2024. I mean it’s an incredible long shot, significantly less than a 0.1% chance. (And if I had to give a number on my confidence of this, it would be upwards of 95%.)

By artificial general intelligence (AGI), I mean a system that can think, plan, learn, and solve problems just like humans do, with:

At least an equal level of data efficiency (e.g. if a human can learn from one example, AGI must also be able to learn from one example and, not, say, one million)
At least an equal level of reliability (e.g. if humans do a task correctly 99.999% of the time, AGI must match or exceed that)
At least an equal level of fluidity or adaptability to novel problems and situations (e.g. if a human can solve a problem with zero training examples, AGI must be able to as well)
At least an equal ability to generate novel and creative ideas

This is the only kind of AI system that could plausibly automate all human labour or cunningly take over the world. No existing AI system is anything like this.

Some recent AI predictions that turned out to be wrong

In March 2025, Dario Amodei, the CEO of Anthropic, predicted that 90% of code would be written by AI as early as June 2025 and no later than September 2025. This turned out to be dead wrong.

In 2016, the renowned AI researcher Geoffrey Hinton predicted that by 2021, AI would automate away all radiology jobs and that turned out to be dead wrong. Even 9 years later, the trend has moved in the opposite direction and there is no indication radiology jobs will be automated away anytime soon.

Many executives and engineers working in autonomous driving predicted we’d have widespread fully autonomous vehicles long ago; some of them have thrown in the towel. Cruise Automation, for example, is no more.

Given this, we should be skeptical when people argue that AI will soon have capabilities far exceeding the ability to code, the ability to do radiology, and the ability to drive.

Benchmarks do not test general intelligence

One of the main arguments for near-term AGI is the performance of LLMs on various benchmarks. However, LLM benchmarks don't accurately represent what real intellectual tasks are actually like.

First, the benchmarks are designed to be solvable by LLMs because they are primarily intended to measure LLMs against each other and to measure improvements in subsequent versions of the same LLM model line (e.g. GPT-5 vs GPT-4o). There isn’t much incentive to create LLM benchmarks where LLMs stagnate around 0%.^[2]

Second, for virtually all LLMs tests or benchmarks, the definition of success or failure on the tasks has to be reduced to something simple enough that software can grade the task automatically. This is a big limitation. When I think about the sort of intellectual tasks that humans do, not a lot of them can be graded automatically.

Of course, there are written exams and tests with multiple choice answers, but these are primarily tests of memorization. We want AI systems that go beyond just memorizing things from huge numbers of examples. We want AI systems that can solve completely novel problems that aren’t a close match for anything in their training dataset. That’s where LLMs are incredibly brittle and start just generating nonsense, saying plainly false (and often ridiculous) things, contradicting themselves, hallucinating, etc.

François Chollet, an AI researcher formerly at Google, gives some great examples of LLM brittleness in a talk here. Chollet also explains how these holes in LLM reasoning get manually patched by paying large workforces to write new training examples specifically to fix them. This creates an impression of increased intelligence, but the improvement isn't from scaling in these cases or from increases in general intelligence, it's from large-scale special casing.

AI in the economy

I think the most robust tests of AI capabilities are tasks that have real world value. If AI systems are actually doing the same intellectual tasks as human beings, then we should see AI systems either automating labour or increasing worker productivity. But we don’t see that.

I’m aware of two studies that looked at the impact of AI assistance on human productivity. One study on customer support workers found mixed results, including a negative impact on productivity for the most experienced employees. Another study, by METR, found a 19% reduction in productivity when coders used an AI coding assistant.

Non-AI companies that have invested in applying AI to the work they do are not seeing that much payoff. There might be some modest benefits in some niches. I’m sure there are at least a few. But LLMs are mostly turning out to be a disappointment in terms of their economic usefulness.^[3]

If you think an LLM scoring more than 100 on an IQ test means it's AGI, then we've had AGI for several years. But clearly there's a problem with that inference, right? Memorizing the answers to IQ tests, or memorizing similar answers to similar questions that you can interpolate, doesn't mean a system actually has the kind of intelligence to solve completely novel problems that have never appeared on any test, or in any text. The same general critique applies to the inference that LLMs are intelligent from their results on virtually any LLM benchmark. Memorization is not intelligence.

If we instead look at performance on practical, economically valuable tasks as the test for AI's competence at intellectual tasks, then its competence appears quite poor. People who make the flawed inference from benchmarks just described say that LLMs can do basically anything. If they instead derived their assessment from LLMs' economic usefulness, it would be closer to the truth to say LLMs can do almost nothing.

AI vs. puzzles

There is also some research on non-real world tasks that supports the idea that LLMs are mass-scale memorizers with a modicum of interpolation or generalization to examples similar to what they've been trained on, rather than genuinely intelligent (in the sense that humans are intelligent). The Apple paper on "reasoning" models found surprisingly mixed results on common puzzles. The finding that sticks out most in my mind is that the LLM's performance on the Tower of Hanoi puzzle did not improve after being told the algorithm for solving the puzzle. Is that real intelligence?

Problems with scaling up LLMs

Predictions of near-term AGI typically rely on scaling up LLMs. However, there is evidence that scaling LLMs is running out of steam:

Toby Ord's interview on the 80,000 Hours Podcast in June covered this topic really well. I highly recommend it.
Renowned AI researcher Ilya Sutskever, formerly the chief scientist at OpenAI (prior to voting to fire Sam Altman), has said he thinks the benefits from scaling LLM pre-training have plateaued.
There have been reports that, internally, employees at AI labs like OpenAI are disappointed with their models' progress.
GPT-5 doesn't seem like that much of an improvement over previous models.
Epoch AI's median estimate of when LLMs will run out of data to train on is 2028.
Epoch AI also predicts that compute scaling will slow down mainly due to how expensive it is and how wasteful it would be to overbuild.

It seems like there is less juice to squeeze from further scaling at the same time that squeezing is getting harder. And there may be an absolute limit to data scaling.

Agentic AI is unsuccessful

If you expand the scope of LLM performance beyond written prompts and responses to "agentic" applications, I think LLMs' failures are more stark and the models do not seem to be gaining mastery of these tasks particularly quickly. Journalists generally say that companies' demos of agentic AI don't work.

I don't expect that performance on agentic tasks will rapidly improve. To train on text-based tasks, AI labs can get data from millions of books and large-scale scrapes of the Internet. There aren't similarly sized datasets for agentic tasks.

In principle, you can use pure reinforcement learning without bootstrapping from imitation learning. However, while this approach has succeeded in domains with smaller spaces of possible actions like go, it has failed in domains with larger spaces of possible actions like StarCraft. There hasn’t been any recent progress in reinforcement learning that would change this and if someone is expecting a breakthrough sometime soon, they should explain why.

The current discrepancy between LLM performance on text-based tasks and agentic tasks also tells us something about whether LLMs are genuinely intelligent. What kind of PhD student can't use a computer?

Conclusion

So, to summarize the core points of this post:

LLM benchmarks don't really tell us how genuinely intelligent LLMs are. They are designed to be easy for LLMs and to be automatically graded, which limits what can be tested.
On economically valuable tasks in real world settings, which I believe are much better tests than benchmarks, LLMs do quite poorly. This makes near-term AGI seem very unlikely.
LLMs fail all the time at tasks we would not expect them to fail at if they were genuinely intelligent, as opposed to relying on mass-scale memorization.
Scaling isn't a solution to the fundamental flaws in LLMs and, in any case, the benefits of scaling are diminishing at the same time that LLM companies are encountering practical limits that may slow compute scaling and slow or even stop data scaling.
LLMs are terrible at agentic tasks and there isn't enough training data for them to improve, if training data is what it takes. If LLMs are genuinely intelligent, we should ask why they can't learn agentic tasks from a small number of examples, since this is what humans do.

Eventually, there will be some AI paradigm beyond LLMs that is better at generality or generalization. However, we don't know what that paradigm is yet and there's no telling how long it will take to be discovered. Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years.^[4]

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain. We don't really know how much computation the brain uses or what the equivalent would be for computers. Estimates vary a lot. A common figure that gets thrown around is that a human brain does about 1 exaflop of computation, which is about what each of the top three supercomputers in the world do. While building a supercomputer or the equivalent in a datacentre is totally feasible for AI labs deploying commercial products, it's not feasible for most AI researchers who want to test and iterate on new, unproven ideas.

A much more pessimistic estimate is that to match the computation of a human brain, we would need to run computers that consume somewhere between about 300 and 300 billion times as much electricity as the entire United States. I hope this isn't true because it seems like it would make progress on AI and brain simulation really hard, but I don't know.

There are more points I could touch on, some of which I briefly mentioned in a previous post. But I think, for now, this is enough.

^{^}
I'm focused on the 7-year time horizon because this is what seems most relevant for effective altruism, given that a lot of people in EA seem to believe AGI will be created in 2-7 years.
^{^}
However, it would be easy to do so, especially if you're willing to do manual grading. Task an LLM with making stock picks that achieve alpha — you could grade that automatically. Try to coax LLMs into coming up with a novel scientific discovery or theoretical insight. Despite trillions of tokens generated, it hasn't happened yet. Tasks related to computer use and "agentic" use cases are also sure to lead to failures. For example, make it play a video game it's never seen before (e.g. because the game just came out). Or, if the game is slow-paced enough, simply ask the LLM to give you instructions on how to play. You can abstract out the computer vision aspect of these tests if you want, although it's worth asking how we're going to have AGI if it can't see.
^{^}
From a Reuters article published recently:
A BofA Global Research's monthly fund manager survey revealed that 54% of investors said they thought that AI stocks were in a bubble compared with 38% who do not believe that a bubble exists.
However, you'd think if this accurately reflected the opinions of people in finance, the bubble would have popped already.

^{^}

See this comment below where I explain some of my reasoning for thinking this. [I added this footnote on October 19, 2025 at 9:19 AM Eastern Time.]

Show all footnotes

21 Reactions

More posts like this

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

Yarrow Bouchard🔸

196

Results from an Adversarial Collaboration on AI Risk (FRI)

Forecasting Research Institute+4 more

The case for multi-decade timelines [Linkpost]

Sharmake

Comments35

Sorted by

New & upvoted

Click to highlight new comments since: Today at 8:18 AM

Steven Byrnes3d68

Eventually, there will be some AI paradigm beyond LLMs that is better at generality or generalization. However, we don't know what that paradigm is yet and there's no telling how long it will take to be discovered. Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years.

Suppose someone said to you in 2018:

There’s an AI paradigm that almost nobody today has heard of or takes seriously. In fact, it’s little more than an arxiv paper or two. But in seven years, people will have already put hundreds of billions of dollars and who knows how many gazillions of hours into optimizing and running the algorithms; indeed, there will be literally 40,000 papers about this paradigm already posted on arxiv. Oh and y’know how right now world experts deploying bleeding-edge AI technology cannot make an AI that can pass an 8th grade science test? Well y’know, in seven years, this new paradigm will lead to AIs that can nail not only PhD qualifying exams in every field at once, but basically every other written test too, including even the international math olympiad with never-before-seen essay-proof math questions. And in seven years, people won’t even be talking about the Turing test anymore, because it’s so obviously surpassed. And… [etc. etc.]

I think you would have read that paragraph in 2018, and described it as “extremely unlikely”, right? It just sounds completely absurd. How could all that happen in a mere seven years? No way.

But that’s what happened!

So I think you should have wider error bars around how long it takes to develop a new AI paradigm from obscurity to AGI. It can be long, it can be short, who knows.

(My actual opinion is that this kind of historical comparison understates how quickly a new AI paradigm could develop, because right now we have lots of resources that did not exist in 2018, like dramatically more compute, better tooling and frameworks like PyTorch and JAX, armies of experts on parallelization, and on and on. These were bottlenecks in 2018, without which we presumably would have gotten the LLMs of today years earlier.)

(My actual actual opinion is that superintelligence will seem to come almost out of nowhere, i.e. it will be just lots of obscure arxiv papers until superintelligence is imminent. See here. But if you don’t buy that strong take, fine, go with the weaker argument above.)

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain.

My own controversial opinion is that the human brain requires much less compute than the LLMs of today. Details here. You don’t have to believe me, but you should at least have wide error bars around this parameter, which makes it harder to argue for a bottom line of “extremely unlikely”. See also Joe Carlsmith’s report which gives a super wide range.

Yarrow Bouchard🔸20h5

right now we have lots of resources that did not exist in 2018, like dramatically more compute, better tooling and frameworks like PyTorch and JAX, armies of experts on parallelization, and on and on. These were bottlenecks in 2018, without which we presumably would have gotten the LLMs of today years earlier.

I fear this may be pointless nitpicking, but if I'm getting the timeline right, PyTorch's initial alpha release was in September 2016, its initial proper public release was in January 2017, and PyTorch version 1.0 was released in October 2018. I'm much less familiar with JAX, but apparently it was released in December 2018. Maybe you simply intended to say that PyTorch and JAX are better today than they were in 2018. I don't know. This just stuck out to me as I was re-reading your comment just now.

For context, OpenAI published a paper about GPT-1 (or just GPT) in 2018, released GPT-2 in 2019, and released GPT-3 in 2020. (I'm going off the dates on the Wikipedia pages for each model.) GPT-1 apparently used TensorFlow, which was initially released in 2015, the same year OpenAI was founded. TensorFlow had a version 1.0 release in 2017, the year before the GPT-1 paper. (In 2020, OpenAI said in a blog post they would be switching to using PyTorch exclusively.)

Steven Byrnes11h2

Maybe you simply intended to say that PyTorch and JAX are better today than they were in 2018.

Yup! E.g. torch.compile “makes code run up to 2x faster” and came out in PyTorch 2.0 in 2023.

More broadly, what I had in mind was: open-source software for everything to do with large-scale ML training—containerization, distributed training, storing checkpoints, hyperparameter tuning, training data and training environments, orchestration and pipelines, dashboards for monitoring training runs, on and on—is much more developed now compared to 2018, and even compared to 2022, if I understand correctly (I’m not a practitioner). Sorry for poor wording. :)

titotal8h9

Presumably a lot of these are all optimised for the current gen-AI paradigm, though. But we're talking about what happens if the current paradigm fails. I'm sure some of it would carry over to a different AI paradigm, but also it's pretty likely there would be other bottleneck we would have to tune to get things working.

I feel like what you're saying is the equivalent of pointing out in 2020 that we have had so many optimisations and computing resources that went into, say, google searches, and then using that as evidence that surely the big data that goes into LLM's should be instantaneous as well.

Steven Byrnes3h2

Presumably a lot of these are all optimised for the current gen-AI paradigm, though. But we're talking about what happens if the current paradigm fails. I'm sure some of it would carry over to a different AI paradigm, but also it's pretty likely there would be other bottleneck we would have to tune to get things working.

Yup, some stuff will be useful and others won’t. The subset of useful stuff will make future researchers’ lives easier and allow them to work faster. For example, here are people using JAX for lots of computations that are not deep learning at all.

I feel like what you're saying is the equivalent of pointing out in 2020 that we have had so many optimisations and computing resources that went into, say, google searches, and then using that as evidence that surely the big data that goes into LLM's should be instantaneous as well.

In like 2010–2015, “big data” and “the cloud” were still pretty hot new things, and people developed a bunch of storage formats, software tools, etc. for distributed data, distributed computing, parallelization, and cloud computing. And yes I do think that stuff turned out to be useful when deep learning started blowing up (and then LLMs after that), in the sense that ML researchers would have made slower progress (on the margin) if not for all that development. I think Docker and Kubernetes are good examples here. I’m not sure exactly how different the counterfactual would have been, but I do think it made more than zero difference.

Yarrow Bouchard🔸3h1

Things like Docker containers or cloud VMs that can be, in principle, applied to any sort of software or computation could be helpful for all sorts of applications we can't anticipate. They are very general-purpose. That makes sense to me.

The extent to which things designed for deep learning, such as PyTorch, could be applied to ideas outside deep learning seems much more dubious.

And if we're thinking about ideas that fall within deep learning, but outside what is currently mainstream and popular, then I simply don't know.

Yarrow Bouchard🔸2d4

If you think that LLMs are a path to AGI, then your analogy actually somewhat hurts your case because it's been 7 years and LLMs aren't AGI. And the near-term AGI belief is that it will take another 2-7 years to get from current LLMs to AGI, with most people thinking it's more than 2 years. So, we should say that it takes at least 9 years and probably more than 9 years — up to 14 years — from the first proof of concept to AGI. (But then this is conditional both on LLMs being a path to AGI and the short timelines I'm arguing against being correct, so, regardless of the exact timing, it doesn't really make sense as an argument against my post.)

If you don't think that LLMs are a path to AGI, the analogy isn't really persuasive one way or the other. 7 years from GPT-1 to GPT-5 shows that a lot of progress in AI can happen in 7 years, which is indeed something I already took for granted in 2018 (although I didn't pay attention to natural language processing until GPT-2 in 2019), but a lot of progress doesn't automatically mean enough progress to get from proof of concept for a new paradigm to AGI, so the analogy doesn't make for a persuasive argument. The argument only makes sense if what it's saying is: LLMs aren't a path to AGI, but if LLMs were a path to AGI, this amount of progress definitely would be enough to get AGI. Which is not persuasive to me and I don't think would be (or should be) persuasive to anyone.

In principle, anything's possible and no one knows what's going to happen with science and technology (as David Deutsch cleverly points out, to know future science/technology is intellectually equivalent to discovering/inventing it), so it's hard to argue against hypothetical scenarios involving speculative future science/technology. But to plan your entire life around your conviction in such hypothetical scenarios seems extremely imprudent and unwise.

I don't think the Turing test actually has been passed, at least in terms of how I think a rigorous Turing test should be conducted. On a weak enough version of the Turing test, then, okay, sure, ELIZA passed it, but a more rigorous Turing test would most likely fail ChatGPT and other LLMs. I'm not aware of anyone who actually conducts such tests, which is kind of interesting.

There is a connection, in my mind, between the Turing test and economically valuable labour. A really advanced, really hard version of the Turing test would be to put an AI in the role of a remote office worker who communicates entirely by Slack, email, and so on. And have them pass as a human for, say, three months. But current AI systems can't do that competently. Their outputs are not actually indistinguishable from humans', except narrowly and superficially.

To give the context I'm coming from, the Turing test has been discussed a lot in philosophy of mind in debates around functionalism. If functionalism is true, then if two systems — say, a biological human and an AI running on a computer — have identical inputs and outputs, then they should also have identical internal mental states. People who argue against functionalism argue that having identical outputs to the same inputs wouldn't demonstrate the existence of any mental states inside the entity. People who argue for functionalism argue it would.

It turns out the Turing test is a concept that I always had a certain idea of which I assumed was shared, but I guess isn't. Similar to how it turns out some people have a fairly weak conception of AGI, e.g. Tyler Cowen saying that o3 is AGI. If o3 is AGI, then it turns out AGI isn't as important as people have been saying it was. Similarly, if the Turing test is as weak a test as some people seem to imagine, then it isn't a particularly important test, and it probably never should have been seen as such.

Steven Byrnes2d19

I don’t think that LLMs are a path to AGI.

Based on your OP, you ought to be trying to defend the claim:

STRONG CLAIM: The probability that a new future AI paradigm would take as little as 7 years to go from obscure arxiv papers to AGI, is extremely low (say, <10%).

But your response seems to have retreated to a much weaker claim:

WEAK CLAIM: The probability that an AI paradigm would take as little as 7 years to go from obscure arxiv papers to AGI, is not overwhelmingly high (say, it’s <90%). Rather, it’s plausible that it would take longer than that.

See what I mean? I think the weak claim is fine. As extremist as I am, I’m not sure even I would go above 90% on that.

Whereas I think the strong claim is implausible, and I don’t think your comment even purports to defend it.

Maybe I shouldn’t have brought up the Turing Test, since it’s a distraction. For what it’s worth, my take is: for any reasonable operationalization of the Turing Test (where “reasonable” means “in the spirit of what Turing might have had in mind”, or “what someone in 2010 might have had in mind”, as opposed to moving the goalposts after knowing the particular profile of strengths and weaknesses of LLMs), a researcher could pass that Turing Test today with at most a modest amount of work and money. I think this fact is so obvious to everyone, that it’s not really worth anyone’s time to even think about the Turing Test anymore in the first place. I do think this is a valid example of how things can be a pipe dream wildly beyond the AI frontier in Year X, and totally routine in Year X+7.

I do not think the Turing Test (as described above) is sufficient to establish AGI, and again, I don’t think AGI exists right now, and I don’t think LLMs will ever become AGI, as I use the term.

In principle, anything's possible and no one knows what's going to happen with science and technology (as David Deutsch cleverly points out, to know future science/technology is intellectually equivalent to discovering/inventing it), so it's hard to argue against hypothetical scenarios involving speculative future science/technology. But to plan your entire life around your conviction in such hypothetical scenarios seems extremely imprudent and unwise.

I don’t “plan [my] entire life around [a] conviction” that AGI will definitely arrive before 2032 (my median guess is that it will be somewhat later than that, and my own technical alignment research is basically agnostic to timelines).

…But I do want to defend the reasonableness of people contingency-planning for AGI very soon. Copying from my comment here:

Pascal’s wager is a scenario where people prepare for a possible risk because there’s even a slight chance that it will actualize. I sometimes talk about “the insane bizarro-world reversal of Pascal’s wager”, in which people don’t prepare for a possible risk because there’s even a slight chance that it won’t actualize. Pascal’s wager is dumb, but “the insane bizarro-world reversal of Pascal’s wager” is much, much dumber still. :) “Oh yeah, it’s fine to put the space heater next to the curtains—there’s no guarantee that it will burn your house down.” :-P

If a potential threat is less than 100% likely to happen, that’s not an argument against working to mitigate it. A more reasonable threshold would be 10%, even 1%, and in some circumstances even less than that. For example, it is not 100% guaranteed that there is any terrorist in the world right now who is even trying to get a nuclear weapon, let alone who has a chance of success, but it sure makes sense for people to be working right now to prevent that “hypothetical scenario” from happening.

Speaking of which, I also want to push back on your use of the term “hypothetical”. Superintelligent AI takeover is a “hypothetical future risk”. What does that mean? It means there’s a HYPOTHESIS that there’s a future risk. Some hypotheses are false. Some hypotheses are true. I think this one is true.

I find it disappointing that people treat “hypothetical” as a mocking dismissal, and I think that usage is a red flag for sloppy thinking. If you think something is unlikely, just call it “unlikely”! That’s a great word! Or if you think it’s overwhelmingly unlikely, you can say that too! When you use words like “unlikely” or “overwhelmingly unlikely”, you’re making it clear that you are stating a belief, perhaps a quite strong belief, and then other people may argue about whether that belief is reasonable. This is all very good and productive. Whereas the term “hypothetical” is kinda just throwing shade in a misleading way, I claim.

Yarrow Bouchard🔸2d1

I don't think I'm retreating into a weaker claim. I'm just explaining why, from my point of view, your analogy doesn't seem to make sense as an argument against my post and why I don't find it persuasive at all (and why I don't think anyone in my shoes would or should find it persuasive). I don't understand why you would interpret this as me retreating into a weaker claim.

If you think the analogy supports a sound argument, then maybe it would be helpful for me if you spelled out the logic step-by-step, including noting what's a premise and what's an inference.

I disagree that a rigorous Turing test is passable by any current AI systems, but that disagreement might just cash out as a disagreement about how rigorous a Turing test ought to be. For what it's worth, in Ray Kurzweil's 2010 movie The Singularity is Near, the AI character is grilled by a panel of judges for hours. That's what I have in mind for a proper Turing test and I think if you did that sort of test with competent judges, no LLM could pass it. Even with a lot of resources.

If images are allowed in the Turing test, then you could send the test subjects the ARC-AGI-2 tasks. Humans can solve these tasks with a ~100% success rate and the best AI systems are under 30%. So, it's in fact known for a certainty that current AI systems would fail a Turing test if it's allowed to be as rigorous as I'm imagining, or as Kurzweil imagined in 2010.

I just mean the word "hypothetical" to mean hypothetical in the conventional way that term is used. The more important and substantive part of what I'm saying is about David Deutsch's point that predicting the content of new science is equivalent to creating it, so, in some sense, the content of new science is unpredictable (unless you count creating new science as predicting it). I'm not sure I can even say what new science is likely or not. Just that I don't know and that no one can know, and maybe scientists working on new ideas have good reasons for their convictions, but there is a scientific process in place for them convincing other people of their convictions and they should follow that process. I think trying to guess probabilities about the content of new science is nearly a pointless exercise, especially when you consider the opportunity cost.

Steven Byrnes2d2

I don't think I'm retreating into a weaker claim. I'm just explaining why, from my point of view, your analogy doesn't seem to make sense as an argument against my post and why I don't find it persuasive at all (and why I don't think anyone in my shoes would or should find it persuasive). I don't understand why you would interpret this as me retreating into a weaker claim.

If you’re making the claim:

The probability that a new future AI paradigm would take as little as 7 years to go from obscure arxiv papers to AGI, is extremely low (say, <10%).

…then presumably you should have some reason to believe that. If your position is “nobody can possibly know how long it will take”, then that obviously is not a reason to believe that claim above. Indeed, your OP didn’t give any reason whatsoever, it just said “extremely unlikely” (“Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years.”)

Then my top comment was like:

Gee, a lot can happen in 7 years in AI, including challenges transitioning from ‘this seems wildly beyond SOTA and nobody has any clue where to even start’ to ‘this is so utterly trivial that we take it for granted and collectively forget it was ever hard’, and including transitioning from ‘kinda the first setup of this basic technique that anyone thought to try’ to ‘a zillion iterations and variations of the technique have been exhaustively tested and explored by researchers around the world’, etc. That seems like a reason to start somewhere like, I dunno, 50-50 on ≤7 years, as opposed to <10%. 50-50 is like saying ‘some things in AI take less than 7 years, and other things take more than 7 years, who knows, shrug’.

Then you replied here that “your analogy is not persuasive”. I kinda took that to mean: my example of LLM development does not prove that a future “obscure arxiv papers to AGI” transition will take ≤7 years. Indeed it does not! I didn’t think I was offering proof of anything. But you are still making a quite confident claim of <10%, and I am still waiting to see any reason at all explaining where that confidence is coming from. I think the LLM example above is suggestive evidence that 7 years is not some crazy number wildly outside the range of reasonable guesses for “obscure arxiv papers to AGI”, whereas you are saying that 7 years is in fact a pretty crazy number, and that sane numbers would be way bigger than 7 years. How much bigger? You didn’t say. Why? You didn’t say.

So that’s my evidence, and yes it’s suggestive not definitive evidence, but OTOH you have offered no evidence whatsoever, AFAICT.

Yarrow Bouchard🔸2d*3

Okay, I think I understand now, hopefully. Thank you for explaining. Your complaint is that I didn't try to substantiate why I think it's extremely unlikely for a new paradigm in AI to go from conception to a working AGI system in 7 years. That's a reasonable complaint.

I would never hold any of these sorts of arguments to the standard of "proving" something or establishing certainty. By saying the argument is not persuasive, I mean it didn't really shift me in one direction or the other.

The reason I didn't find your analogy persuasive is that I'm already aware of the progress there's been in AI since 2012 in different domains including computer vision, natural language processing, games (imitation learning and reinforcement learning in virtual environments), and robotics. So, your analogy didn't give me any new information to update on.

My reason for thinking it's extremely unlikely is just an intuition from observing progress in AI (and, to some extent, other fields). It seems like your analogy is an attempt to express your own intuition about this from watching AI progress. I can understand the intention now and I can respect that as a reasonable attempt at persuasion. It might be persuasive to someone in my position who is unaware of how fast some AI progress has been.

I think I was misinterpreting it too much as an argument with a clear logical structure and not enough as an attempt to express an intuition. I think as the latter it's perfectly fine, and it would be too much to expect the former in such a context.

I can't offer much in this context (I don't think anyone can). The best I can do is just try to express my intuition, like you did. What you consider fast or slow in terms of progress depends where you start and end and what examples you choose. If you pick deep learning as your example, and if you start at the invention of backpropagation in 1970 and end at AlexNet in 2011, that's 41 years from conception to realization.

A factor that makes a difference is there just seems little interest in funding fundamental AI research outside of the sort of ideas that are already in the mainstream. For example, Richard Sutton has said it's hard to get funding for fundamental AI research. It's easier for him given his renown as an AI researcher, but the impression I get is that fundamental research funding overall is scarce, and it's especially scarce if you're working on novel, unusual, off-the-beaten-path ideas. So, even if there is an arXiv paper out there somewhere that has the key insight or key insights needed to get to AGI, the person who wrote it probably can't get funded and they're probably now working on a McDonald's drive-through LLM.

[Edit: See my reply to this comment below for an important elaboration on why I think getting from an arXiv paper to AGI within 7 years is unlikely.]

Out of curiosity, what do you think of my argument that LLMs can't pass a rigorous Turing test because a rigorous Turing test could include ARC-AGI 2 as a subset (and, indeed, any competent panel of judges should include it) and LLMs can't pass that? Do you agree? Do you think that's a higher level of rigour than a Turing test should have and that's shifting the goal posts?

Yarrow Bouchard🔸10h*3

I should add, fairly belatedly, another point of comparison. Two Turing Award-winning AI researchers, Yann LeCun and Richard Sutton, each have novel fundamental ideas — not based on scaling LLMs or other comparably mainstream ideas — for how to get to AGI. (A few days ago, I wrote a comment about this here.)

In a 2024 interview, Yann LeCun said he thought it would take "at least a decade and probably much more" to get to AGI or human-level AI by executing his research roadmap. Trying to pinpoint when ideas first started is a fraught exercise. If we say the start time is the 2022 publication of LeCun's position paper "A Path Towards Autonomous Machine Intelligence", then by LeCun's own estimate, the time from publication to human-level AI is at least 12 years and "probably much more".

In another 2024 interview, Richard Sutton said he thinks there's a 25% chance by 2030 we'll "understand intelligence", although it's unclear to me if he imagines by 2030 there's a 25% chance we'll actually build AGI (or be in a position to do so straightforwardly) or just have the fundamental theoretical knowledge required to do so. The equivalent paper co-authored by Sutton is "The Alberta Plan for AI Research", coincidentally also published in 2022. So, Sutton's own estimate is a 25% chance of success in 8 years, although it's not clear if success here means actually building AGI or a different goal.

But, crucially, I also definitely don't think we should just automatically accept these numbers. (I also discussed this in my previous comment about this here.) Researchers like Yann LeCun and Richard Sutton have a very high level of self-belief, which I think is psychologically healthy and rational. It is good to be this ambitious. But we shouldn't think of these as predictions or forecasts, but rather as goals.

LeCun himself has explicitly said you should be skeptical of anyone who says they have found the secret to AGI and will deliver it ten years, including him (as I discussed here). Which of course is very reasonable!

In the 2024 interview, Sutton said:

I think we should strive for, like, you know, 2030, and knowing that we probably won't succeed, but you have to try.

This was in response to one of the interviewers noting that Sutton had said "decades", plural, when he said "these are the decades when we're going to figure out how the mind works."

We have good reason to be skeptical if we look at predictions from people in AI that have now come false, such as Dario Amodei's incorrect prediction about AI writing 90% of code by mid-September 2025 or, for that matter, his prediction made 2 years and 2 months ago that we could have something that sounds a lot like AGI in 2 or 3 years, which still has 10 months left to go but looks extremely dubious. As I mentioned in the post, there's also Geoffrey Hinton's prediction about radiology getting automated and various wrong predictions from various people in AI about widespread fully autonomous driving.

So, to summarize: what Yann LeCun and Richard Sutton are saying is already much more conservative than a trajectory from publishing a paper to building AGI within 7 years. They both tell us to be skeptical of even the timelines they lay out. And, independent of whether they tell us to be skeptical or not, based on the track record of similar predictions, we have good reason to be skeptical.

To me, this seems to be the much more apt point of comparison than the progress of LLMs from 2018 to 2025.

Steven Byrnes2d4

Thanks!

Out of curiosity, what do you think of my argument that LLMs can't pass a rigorous Turing test because a rigorous Turing test could include ARC-AGI 2 as a subset (and, indeed, any competent panel of judges should include it) and LLMs can't pass that? Do you agree? Do you think that's a higher level of rigour than a Turing test should have and that's shifting the goal posts?

I think we both agree that there are ways to tell apart a human from an LLM of 2025, including handing ARC-AGI-2 to each.

Whether or not that fact means “LLMs of 2025 cannot pass the Turing Test” seems to be purely an argument about the definition / rules of “Turing Test”. Since that’s a pointless argument over definitions, I don’t really care to hash it out further. You can have the last word on that. Shrug :-P

Yarrow Bouchard🔸2d3

Okay, since you're giving me the last word, I'll take it.

There are some ambiguities in terms of how to interpret the concept of the Turing test. People have disagreed about what the rules should be. I will say that in Turing's original paper, he did introduce the concept of testing the computer via sub-games:

Q: Do you play chess?

A: Yes.

Q: I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?

A: (After a pause of 15 seconds) R-R8 mate.

Including other games or puzzles, like the ARC-AGI 2 puzzles, seems in line with this.

My understanding of the Turing test has always been that there should be basically no restrictions at all — no time limit, no restrictions on what can be asked, no word limit, no question limit.

In principle, I don't see why you wouldn't allow sending of images, but if you only allowed text-based questions, I suppose even then a judge could tediously write out the ARC-AGI 2 tasks, since they consist of coloured squares in a 30 x 30 grid, and ask the interlocutor to re-create them in Paint.

To be clear, I don't think ARC-AGI 2 is nearly the only thing you could use to make an LLM fail the Turing test, it's just an easy example.

In Daniel Dennett's 1985 essay "Can Machines Think?" on the Turing test (included in the anthology Brainchildren), Dennett says that "the unrestricted test" is "the only test that is of any theoretical interest at all". He emphasizes that judges should be able to ask anything:

People typically ignore the prospect of having the judge ask off-the-wall questions in the Turing test, and hence they underestimate the competence a computer would have to have to pass the test. But remember, the rules of the imitation game as Turing presented it permit the judge to ask any question that could be asked of a human being—no holds barred.

He also warns:

Cheapened versions of the Turing test are everywhere in the air. Turing's test is not just effective, it is entirely natural—this is, after all, the way we assay the intelligence of each other every day. And since incautious use of such judgments and such tests is the norm, we are in some considerable danger of extrapolating too easily, and judging too generously, about the understanding of the systems we are using.

It's true that before we had LLMs we had lower expectations of what computers can do and asked easier questions. But it doesn't seem right to me to say that as computers get better at natural language, we shouldn't be able to ask harder questions.

I do think the definition and conception of the Turing test is important. If people say that LLMs have passed the Turing test and that's not true, it gives a false impression of LLMs' capabilities, just like when people falsely claim LLMs are AGI.

You could qualify this by saying LLMs can pass a restricted, weak version of the Turing test — but not an unrestricted, adversarial Turing test — which was also true of older computer systems before deep learning. This would sidestep the question of defining the "true" Turing test and still give accurate information.

NunoSempere3d20

At least an equal level of data efficiency
...
This is the only kind of AI system that could plausibly automate all human labour

Your bar is too high, you can automate all human labour with less data efficiency.

Yarrow Bouchard🔸3d-1

This apparently isn’t true for autonomous driving and it’s probably even less true in a lot of other domains. If an AI system can’t respond well to novelty, it can’t function in the world because novelty occurs all the time. For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?

Davidmanheim3d9

For autonomous driving, current approaches which "can't deal with novelty" are already far safer than human drivers.

Yarrow Bouchard🔸2d-5

Charlie_Guthmann3d3

For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?

The bar is much lower because they are 100x faster and 1000x cheaper than me. They open up a bunch of brute forceable techniques in the same way that you can open up https://projecteuler.net/ solve many of eulers discoveries with little math knowledge but basic python and for loops.

Math -> re read every arxiv paper -> translate them all into lean -> aggregate every open well specificied math problem -> use the database of all previous learnings to see if you can chain chunks of previous problems together to solve.

clinical medicine -> re-read every RCT ever done and comprehensively rank intervention effectiveness by disease -> find cost data where available and rank the cost/qaly of all disease/intervention space

Econometrics -> aggregate every natural experiment and instrumental variable ever used in an econometrics paper -> think about other use cases for these tools -> search if other use cases have available data -> reapply the general theory of the original paper with the new data.

Yarrow Bouchard🔸2d2

I'm not sure if I understand what you're arguing.

First, why do you think LLMs haven't already done of any of these things?

Second, even if LLMs could do these things, they couldn't automate all of human labour, and this isn't an argument that they could. This is an argument that LLMs could do some really useful things, not that they could do all the useful things that human workers do.

Unless, I guess, if you think there's no such thing as something so novel it can't be understood by LLMs based on existing knowledge, but then this would be equivalent to arguing that LLMs have or will have a very high level of data efficiency.

Charlie_Guthmann2d1

i'm fleshing out nunos point a bit. Basically AI have so many systematic advantages with their cost/speed/seemless integration into the digital world that they can afford to be worse than humans at a variety of things and still automate (most/all/some) work. Just as a plane doesn't need to flap it's wings. Of course I wasn't saying I solved automating the economy. I'm just showing you ways in which something lacking some top level human common sense/iq/whatever could replace still.

FWIW I basically disagree with every point you made in the summary. This mostly just comes from using these tools every day and getting utility out of them + seeing how fast they are improving + seeing how many different routes there are to improvement (i was quite skeptical a year ago, not so anymore). But I wanted to keep the argument contained and isolate a point of disagreement.

Yarrow Bouchard🔸2d2

I want to try to separate out a few different ideas because I worry they might get confused together.

Are actual existing LLMs good at discovering novel ideas? No. They haven't discovered anything useful in any domain yet. They haven't come up with any interesting new idea in science, math, economics, medicine, or anything.
Could LLMs eventually discover novel ideas in the way you described? I don't think so. I think you're saying you think this will happen. Okay, so, why? What are LLMs missing now that they will have in, say, 5 years that will mean they make the jump from zero novel ideas to lots of novel ideas? Is it just scale?
Would an AI system that can't learn new ideas from one example or a few examples count as AGI? No, I don't think so.
Would an AI system that can't learn new ideas from one example or a few examples be able to automate all human labour? No, I don't think so because this kind of learning is part of many different jobs, such as scientist, philosopher, and journalist, and also taxi driver (per the above point about autonomous vehicles).

I do use ChatGPT every day and find it to be a useful tool for what I use it for, which is mainly a form of search engine. I used ChatGPT when it first launched, as well as GPT-4 when it first launched, and have been following the progress.

Everything is relative to expectations. If I'm judging ChatGPT based on the expectations of a typical consumer tech product, or even a cool AI science experiment, then I do find the progress impressive. On the other hand, if I'm judging ChatGPT as a potential precursor to AGI, I don't find the progress particularly impressive.

I guess I don't see the potential routes to improvement that you see. The ones that I've seen discussed don't strike me as promising.

Charlie_Guthmann2d2

https://x.com/slow_developer/status/1979157947529023997
I would bet a lot of money you are going to see exactly what I described for math in the next two years. The capabilities literally just exploded. It took us like 20 years to start using the lightbulb but you are expecting results from products that came out in the last few weeks/months.

I can also confidently say because I am working on a project with doctors that the work I described for clinical medicine is being tested and happening right now. It's exact usefulness remains to be seen but like people are trying exactly what I described, there will be some lag as people need to learn how to use the tools best and then distribute their results.

Again, I don't think most of this stuff was particularly useful with the tools available to use >1 year ago.

>Would an AI system that can't learn new ideas from one example or a few examples count as AGI?

https://www.anthropic.com/news/skills
you are going to need to be a lot more precise in your definitions imo otherwise we are going to talk past each other.

Yarrow Bouchard🔸2d1

The math example you cited doesn't seem to an example of an LLM coming up with a novel idea in math. It just sounds like mathematicians are using an LLM as a search tool. I agree that LLMs are really useful for search, but this is a far cry from an LLM actually coming up with a novel idea itself.

The point you raise about LLMs doing in-context learning is ably discussed the video I embedded in the post.

Charlie_Guthmann2d1

"novel idea" means almost nothing to me. A math proof is simply a->b. It doesn't matter how you figure out a->b. If you can figure it out by reading 16 million papers and clicking them together that still counts. There are many ways to cook an egg.

Yarrow Bouchard🔸2d1

I don't think the LLMs in this case are clicking them together. Rather, it seems like the LLMs are being used as a search tool for human mathematicians who are clicking them together.

If you could give the LLM a prompt along the lines of, "Read the mathematics literature and come up with some new proofs based on that," and it could do it, then I would count that as an LLM successfully coming up with a proof, and with a novel idea.

Based on the tweets you linked to, what seems to be happening is that the LLMs are being used as a search tool like Google Scholar, and it's the mathematicians coming up with the proofs, not the search engine.

Charlie_Guthmann2d1

Sure that's a fair point. I'd guess I hope you would feel at least a little pushed in the direction after this thread that AIs need not take a similar route to humans to automating large amounts of our current work.

Yarrow Bouchard🔸1d*1

LLMs may have some niches in which they enhance productivity, such as by serving as an advanced search engine or text search tool for mathematicians. This is quite different than AGI and quite different from either:

a) LLMs having a broad impact on productivity across the economy (which would not necessarily amount to AGI but which would be economically significant)

b) LLMs fully automating jobs by acting autonomously and doing hierarchical planning over very long time horizons (which is the sort of thing AGI would have to be capable of doing to meet the conventional definition of AGI).

If you want to argue LLMs will get from their current state where they can’t do (a) or (b) to a state where they will be able to do (a) and/or (b), then I think you have to address my arguments in the post about LLMs’ apparent fundamental weaknesses (e.g. the Tower of Hanoi example seems stark to me) and what I said about the obstacles to scaling LLMs further (e.g. Epoch AI estimates that data may run out around 2028).

Davidmanheim3d-1

AI will hunt down the last remaining human, and with his last dying breath, humanity will end - not with a bang, but with a "you don't really count as AGI"

Rasool2d3

In March 2025, Dario Amodei, the CEO of Anthropic, predicted that 90% of code would be written by AI as early as June 2025 and no later than September 2025. This turned out to be dead wrong.

Amodei claims that 90% of code at Anthropic (and some companies they work with) is being written by AI

Yarrow Bouchard🔸2d1

His prediction was about all code, not Anthropic's code, so his prediction is still false. The article incorrectly states in the italicized section under the title (I believe it's called the deck) the prediction was about Anthropic's code, but this is what he said in March 2025:

I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code

There was no qualifier that this was only about Anthropic's code. It's about all code.

I'll be blunt: I think Dario saying "Some people think that prediction is wrong" is dishonest. If you make a prediction and it's wrong, you should just admit that it's wrong.

niplav1h3

The relevant is this timestamp in an interview. Relevant part of the interview:

But now, getting to the job side of this, I do have a fair amount of concern about this. On one hand, I think comparative advantage is a very powerful tool. If I look at coding, programming, which is one area where AI is making the most progress, what we are finding is we are not far from the world—I think we'll be there in three to six months—where AI is writing 90 percent of the code. And then in twelve months, we may be in a world where AI is writing essentially all of the code. But the programmer still needs to specify what the conditions of what you're doing are, what the overall app you're trying to make is, what the overall design decision is. How do we collaborate with other code that's been written? How do we have some common sense on whether this is a secure design or an insecure design? So as long as there are these small pieces that a programmer, a human programmer, needs to do, the AI isn't good at, I think human productivity will actually be enhanced. But on the other hand, I think that eventually all those little islands will get picked off by AI systems. And then we will eventually reach the point where the AIs can do everything that humans can. And I think that will happen in every industry.

For what it's worth at the time I thought he was talking about code at Anthropic, and another commenter agreed. The "we are finding" indicates to me that it's at Anthropic. Claude 4.5 Sonnet disagree with me and says that it can be read as being about the entire world.

(I really hope you're right and the entire AI industry goes up in flames next year.)

Yarrow Bouchard🔸35m4

To me, that quote really sounds like it's about code in general, not code at Anthropic.

Dario's own interpretation of the prediction, even now that it's come false, seems to be about code in general, based on this defense:

I made this prediction that, you know, in six months, 90% of code would be written by AI models. Some people think that prediction is wrong, but within Anthropic and within a number of companies that we work with, that is absolutely true now.

If the prediction was just about Anthropic's code, you'd think he would just say:

I made this prediction that in six months 90% of Anthropic's code would be written by AI and now within Anthropic that is absolutely true now.

What he actually said comes across as a defense of a prediction he knows was at least partially falsified or is at least in doubt. If he just meant 90% of Anthropic's code would be written by AI, he could just say he was unambiguously right and there's no doubt about it.

Edit:

To address the part of your comment that changed after you edited it, in my interpretation, "we are finding" just means "we are learning" or "we are gaining information that" and is general enough that it doesn't by itself support any particular interpretation. For example, he could have said:

...what we are finding is we are not far from the world—I think we'll be there in three to six months—where AI is writing 90 percent of grant applications.

I wouldn't interpret this to mean that Anthropic is writing any grant applications at all. My interpretation wouldn't be different with or without the "what we are finding" part. If he just said, "I think we are not far from the world...", to me, that would mean exactly the same thing.