A week for sharing incomplete, scrappy, or long languishing drafts. Read more

Hide table of contents

Note: This is a fairly rough post I adapted from some comments I recently wrote that I worked hard enough on that I figured I should probably make them into a post. So, although this post is technically not a draft, it isn't written how I would write a post — it's less polished and more off the cuff. If you think I should remove the Draft Amnesty tag, please say so, and I will!

I very, very strongly believe there’s essentially no chance of artificial general intelligence (AGI) being developed within the next 7 years.[1] Previously, I wrote a succinct list of reasons to doubt near-term AGI here. For example, it might come as a surprise that around three quarters of AI experts don’t think scaling up large language models (LLMs) will lead to AGI!

For a more technical and theoretical view (from someone with better credentials), I highly recommend this recent video by an AI researcher that makes a strong case for skepticism of near-term AGI and of LLMs as a path to AGI:

In this post, I will give more detailed arguments about why I think near-term AGI is so unlikely and why I think LLMs won't scale to AGI. 

Clarification of terms

By essentially no chance, I mean less than the chance of Jill Stein running as the Green Party candidate in 2028 and winning the U.S. Presidency. Or, if you like, less than the chance Jill Stein had of winning the presidency in 2024. I mean it’s an incredible long shot, significantly less than a 0.1% chance. (And if I had to give a number on my confidence of this, it would be upwards of 95%.)

By artificial general intelligence (AGI), I mean a system that can think, plan, learn, and solve problems just like humans do, with:

  • At least an equal level of data efficiency (e.g. if a human can learn from one example, AGI must also be able to learn from one example and, not, say, one million)
  • At least an equal level of reliability (e.g. if humans do a task correctly 99.999% of the time, AGI must match or exceed that)
  • At least an equal level of fluidity or adaptability to novel problems and situations (e.g. if a human can solve a problem with zero training examples, AGI must be able to as well)
  • At least an equal ability to generate novel and creative ideas

This is the only kind of AI system that could plausibly automate all human labour or cunningly take over the world. No existing AI system is anything like this. 

Some recent AI predictions that turned out to be wrong

In March 2025, Dario Amodei, the CEO of Anthropic, predicted that 90% of code would be written by AI as early as June 2025 and no later than September 2025. This turned out to be dead wrong. 

In 2016, the renowned AI researcher Geoffrey Hinton predicted that by 2021, AI would automate away all radiology jobs and that turned out to be dead wrong. Even 9 years later, the trend has moved in the opposite direction and there is no indication radiology jobs will be automated away anytime soon. 

Many executives and engineers working in autonomous driving predicted we’d have widespread fully autonomous vehicles long ago; some of them have thrown in the towel. Cruise Automation, for example, is no more. 

Given this, we should be skeptical when people argue that AI will soon have capabilities far exceeding the ability to code, the ability to do radiology, and the ability to drive. 

Benchmarks do not test general intelligence

One of the main arguments for near-term AGI is the performance of LLMs on various benchmarks. However, LLM benchmarks don't accurately represent what real intellectual tasks are actually like. 

First, the benchmarks are designed to be solvable by LLMs because they are primarily intended to measure LLMs against each other and to measure improvements in subsequent versions of the same LLM model line (e.g. GPT-5 vs GPT-4o). There isn’t much incentive to create LLM benchmarks where LLMs stagnate around 0%.[2]

Second, for virtually all LLMs tests or benchmarks, the definition of success or failure on the tasks has to be reduced to something simple enough that software can grade the task automatically. This is a big limitation. When I think about the sort of intellectual tasks that humans do, not a lot of them can be graded automatically. 

Of course, there are written exams and tests with multiple choice answers, but these are primarily tests of memorization. We want AI systems that go beyond just memorizing things from huge numbers of examples. We want AI systems that can solve completely problems that aren’t a close match for anything in their training dataset. That’s where LLMs are incredibly brittle and start just generating nonsense, saying plainly false (and often ridiculous) things, contradicting themselves, hallucinating, etc. 

François Chollet, an AI researcher formerly at Google, gives some great examples of LLM brittleness in a talk here. Chollet also explains how these holes in LLM reasoning get manually patched by paying large workforces to write new training examples specifically to fix them. This creates an impression of increased intelligence, but the improvement isn't from scaling in these cases or from increases in general intelligence, it's from large-scale special casing.

AI in the economy

I think the most robust tests of AI capabilities are tasks that have real world value. If AI systems are actually doing the same intellectual tasks as human beings, then we should see AI systems either automating labour or increasing worker productivity. But we don’t see that.

I’m aware of two studies that looked at the impact of AI assistance on human productivity. One study on customer support workers found mixed results, including a negative impact on productivity for the most experienced employees. Another study, by METR, found a 19% reduction in productivity when coders used an AI coding assistant.

Non-AI companies that have invested in applying AI to the work they do are not seeing that much payoff. There might be some modest benefits in some niches. I’m sure there are at least a few. But LLMs are mostly turning out to be a disappointment in terms of their economic usefulness.[3]

If you think an LLM scoring more than 100 on an IQ test means it's AGI, then we've had AGI for several years. But clearly there's a problem with that inference, right? Memorizing the answers to IQ tests, or memorizing similar answers to similar questions that you can interpolate, doesn't mean a system actually has the kind of intelligence to solve completely novel problems that have never appeared on any test, or in any text. The same general critique applies to the inference that LLMs are intelligent from their results on virtually any LLM benchmark. Memorization is not intelligence.

If we instead look at performance on practical, economically valuable tasks as the test for AI's competence at intellectual tasks, then its competence appears quite poor. People who make the flawed inference from benchmarks just described say that LLMs can do basically anything. If they instead derived their assessment from LLMs' economic usefulness, it would be closer to the truth to say LLMs can do almost nothing. 

AI vs. puzzles

There is also some research on non-real world tasks that supports the idea that LLMs are mass-scale memorizers with a modicum of interpolation or generalization to examples similar to what they've been trained on, rather than genuinely intelligent (in the sense that humans are intelligent). The Apple paper on "reasoning" models found surprisingly mixed results on common puzzles. The finding that sticks out most in my mind is that the LLM's performance on the Tower of Hanoi puzzle did not improve after being told the algorithm for solving the puzzle. Is that real intelligence?

Problems with scaling up LLMs

Predictions of near-term AGI typically rely on scaling up LLMs. However, there is evidence that scaling LLMs is running out of steam:

  • Toby Ord's interview on the 80,000 Hours Podcast in June covered this topic really well. I highly recommend it.
  • Renowned AI researcher Ilya Sutskever, formerly the chief scientist at OpenAI (prior to voting to fire Sam Altman), has said he thinks the benefits from scaling LLM pre-training have plateaued.
  • There have been reports that, internally, employees at AI labs like OpenAI are disappointed with their models' progress.
  • GPT-5 doesn't seem like that much of an improvement over previous models.
  • Epoch AI's median estimate of when LLMs will run out of data to train on is 2028.
  • Epoch AI also predicts that compute scaling will slow down mainly due to how expensive it is and how wasteful it would be to overbuild.

It seems like there is less juice to squeeze from further scaling at the same time that squeezing is getting harder. And there may be an absolute limit to data scaling.

Agentic AI is unsuccessful

If you expand the scope of LLM performance beyond written prompts and responses to "agentic" applications, I think LLMs' failures are more stark and the models do not seem to be gaining mastery of these tasks particularly quickly. Journalists generally say that companies' demos of agentic AI don't work. 

I don't expect that performance on agentic tasks will rapidly improve. To train on text-based tasks, AI labs can get data from millions of books and large-scale scrapes of the Internet. There aren't similarly sized datasets for agentic tasks. 

In principle, you can use pure reinforcement learning without bootstrapping from imitation learning. However, while this approach has succeeded in domains with smaller spaces of possible actions like go, it has failed in domains with larger spaces of possible actions like StarCraft. There hasn’t been any recent progress in reinforcement learning that would change this and if someone is expecting a breakthrough sometime soon, they should explain why. 

The current discrepancy between LLM performance on text-based tasks and agentic tasks also tells us something about whether LLMs are genuinely intelligent. What kind of PhD student can't use a computer? 

Conclusion

So, to summarize the core points of this post:

  • LLM benchmarks don't really tell us how genuinely intelligent LLMs are. They are designed to be easy for LLMs and to be automatically graded, which limits what can be tested.
  • On economically valuable tasks in real world settings, which I believe are much better tests than benchmarks, LLMs do quite poorly. This makes near-term AGI seem very unlikely.
  • LLMs fail all the time at tasks we would not expect them to fail at if they were genuinely intelligent, as opposed to relying on mass-scale memorization.
  • Scaling isn't a solution to the fundamental flaws in LLMs and, in any case, the benefits of scaling are diminishing at the same time that LLM companies are encountering practical limits that may slow compute scaling and slow or even stop data scaling.
  • LLMs are terrible at agentic tasks and there isn't enough training data for them to improve, if training data is what it takes. If LLMs are genuinely intelligent, we should ask why they can't learn agentic tasks from a small number of examples, since this is what humans do.

Eventually, there will be some AI paradigm beyond LLMs that is better at generality or generalization. However, we don't know what that paradigm is yet and there's no telling how long it will take to be discovered. Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years. 

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain. We don't reslly know how much computation the brain uses or what the equivalent would be for computers. Estimates vary a lot. A common figure that gets thrown around is that a human brain does about 1 exaflop of computation, which is about what each of the top three supercomputers in the world do. While building a supercomputer or the equivalent in a datacentre is totally feasible for AI labs deploying commercial products, it's not feasible for most AI researchers who want to test and iterate on new, unproven ideas.

A much more pessimistic estimate is that to match the computation of a human brain, we would need to run computers that consume somewhere between about 300 and 300 billion times as much electricity as the entire United States. I hope this isn't true because it seems like it would make progress on AI and brain simulation really hard, but I don't know.

There are more points I could touch on, some of which I briefly mentioned in a previous post. But I think, for now, this is enough.

  1. ^

    I'm focused on the 7-year time horizon because this is what seems most relevant for effective altruism, given that a lot of people in EA seem to believe AGI will be created in 2-7 years. 

  2. ^

    However, it would be easy to do so, especially if you're willing to do manual grading. Task an LLM with making stock picks that achieve alpha — you could grade that automatically. Try to coax LLMs into coming up with a novel scientific discovery or theoretical insight. Despite trillions of tokens generated, it hasn't happened yet. Tasks related to computer use and "agentic" use cases are also sure to lead to failures. For example, make it play a video game it's never seen before (e.g. because the game just came out). Or, if the game is slow-paced enough, simply give you instructions on how to play. You can abstract out the computer vision aspect of these tests if you want, although it's worth asking how we're going to have AGI if it can't see. 

  3. ^

    From a Reuters article published recently: 

    A BofA Global Research's monthly fund manager survey revealed that 54% of investors said they thought that AI stocks were in a bubble compared with 38% who do not believe that a bubble exists.

    However, you'd think if this accurately reflected the opinions of people in finance, the bubble would have popped already.

Comments7


Sorted by Click to highlight new comments since:

Eventually, there will be some AI paradigm beyond LLMs that is better at generality or generalization. However, we don't know what that paradigm is yet and there's no telling how long it will take to be discovered. Even if, by chance, it were discovered soon, it's extremely unlikely it would make it all the way from conception to working AGI system within 7 years.

Suppose someone said to you in 2018:

There’s an AI paradigm that almost nobody today has heard of or takes seriously. In fact, it’s little more than an arxiv paper or two. But in seven years, people will have already put hundreds of billions of dollars and who knows how many gazillions of hours into optimizing and running the algorithms; indeed, there will be literally 40,000 papers about this paradigm already posted on arxiv. Oh and y’know how right now world experts deploying bleeding-edge AI technology cannot make an AI that can pass an 8th grade science test? Well y’know, in seven years, this new paradigm will lead to AIs that can nail not only PhD qualifying exams in every field at once, but basically every other written test too, including even the international math olympiad with never-before-seen essay-proof math questions. And in seven years, people won’t even be talking about the Turing test anymore, because it’s so obviously surpassed. And… [etc. etc.]

I think you would have read that paragraph in 2018, and described it as “extremely unlikely”, right? It just sounds completely absurd. How could all that happen in a mere seven years? No way.

But that’s what happened!

So I think you should have wider error bars around how long it takes to develop a new AI paradigm from obscurity to AGI. It can be long, it can be short, who knows.

(My actual opinion is that this kind of historical comparison understates how quickly a new AI paradigm could develop, because right now we have lots of resources that did not exist in 2018, like dramatically more compute, better tooling and frameworks like PyTorch and JAX, armies of experts on parallelization, and on and on. These were bottlenecks in 2018, without which we presumably would have gotten the LLMs of today years earlier.)

(My actual actual opinion is that superintelligence will seem to come almost out of nowhere, i.e. it will be just lots of obscure arxiv papers until superintelligence is imminent. See here. But if you don’t buy that strong take, fine, go with the weaker argument above.)

This is particularly true if running an instance of AGI requires a comparable amount of computation as a human brain.

My own controversial opinion is that the human brain requires much less compute than the LLMs of today. Details here. You don’t have to believe me, but you should at least have wide error bars around this parameter, which makes it harder to argue for a bottom line of “extremely unlikely”. See also Joe Carlsmith’s report which gives a super wide range.

At least an equal level of data efficiency

...

This is the only kind of AI system that could plausibly automate all human labour

Your bar is too high, you can automate all human labour with less data efficiency.

AI will hunt down the last remaining human, and with his last dying breath, humanity will end - not with a bang, but with a "you don't really count as AGI"

This apparently isn’t true for autonomous driving and it’s probably even less true in a lot of other domains. If an AI system can’t respond well to novelty, it can’t function in the world because novelty occurs all the time. For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?

For autonomous driving, current approaches which "can't deal with novelty" are already far safer than human drivers.

For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?

The bar is much lower because they are 100x faster and 1000x cheaper than me. They open up a bunch of brute forceable techniques in the same way that you can open up  https://projecteuler.net/ solve many of eulers discoveries with little math knowledge but basic python and for loops. 

Math -> re read every arxiv paper -> translate them all into lean -> aggregate every open well specificied math problem -> use the database of all previous learnings to see if you can chain chunks of previous problems together to solve. 

clinical medicine -> re-read every RCT ever done and comprehensively rank intervention effectiveness by disease -> find cost data where available and rank the cost/qaly of all disease/intervention space

Econometrics -> aggregate every natural experiment and instrumental variable ever used in an econometrics paper -> think about other use cases for these tools -> search if other use cases have available data -> reapply the general theory of the original paper with the new data. 

In March 2025, Dario Amodei, the CEO of Anthropic, predicted that 90% of code would be written by AI as early as June 2025 and no later than September 2025. This turned out to be dead wrong. 

Amodei claims that 90% of code at Anthropic (and some companies they work with) is being written by AI

Curated and popular this week
 ·  · 24m read
 · 
This post is based on a memo I wrote for this year’s Meta Coordination Forum. See also Arden Koehler’s recent post, which hits a lot of similar notes.  Summary The EA movement stands at a crossroads. In light of AI’s very rapid progress, and the rise of the AI safety movement, some people view EA as a legacy movement set to fade away; others think we should refocus much more on “classic” cause areas like global health and animal welfare. I argue for a third way: EA should embrace the mission of making the transition to a post-AGI society go well, significantly expanding our cause area focus beyond traditional AI safety. This means working on neglected areas like AI welfare, AI character, AI persuasion and epistemic disruption, human power concentration, space governance, and more (while continuing work on global health, animal welfare, AI safety, and biorisk). These additional cause areas are extremely important and neglected, and particularly benefit from an EA mindset (truth-seeking, scope-sensitive, willing to change one’s mind quickly). I think that people going into these other areas would be among the biggest wins for EA movement-building right now — generally more valuable than marginal technical safety or safety-related governance work. If we can manage to pull it off, this represents a potentially enormous opportunity for impact for the EA movement. There's recently been increased emphasis on "principles-first" EA, which I think is great. But I worry that in practice a "principles-first" framing can become a cover for anchoring on existing cause areas, rather than an invitation to figure out what other cause areas we should be working on. Being principles-first means being genuinely open to changing direction based on new evidence; if the world has changed dramatically, we should expect our priorities to change too. This third way will require a lot of intellectual nimbleness and willingness to change our minds. Post-FTX, much of EA adopted a "PR ment
 ·  · 2m read
 · 
TLDR  EA is a community where time tracking is already very common and yet most people I talk to don't because 1. It's too much work (when using toggl, clockify, ...) 2. It's not accurate enough (when using RescueTime, rize, ...) I built https://donethat.ai that solves both of these with AI as part of AIM's Founding to Give program. Give it a try (and use discount code "EA" after the 14d trial to get another month free). You should probably track your time I'd argue that for most people, your time is your most valuable resource.[1] Even though your day has 24 hours, eight of those are already used up for sleep, another eight probably for social life, gym, food prep and eating, life admin, commute, leaving max eight hours to have impact. Oliver Burkeman argues in his recent book Meditations for Mortals that eight is still too high - most high impact work gets done in four hours every day - the rest is just fluff and feeling busy.[2] Now, how do you spend those four hours? When it comes to our other scarce resource - money - most people and companies keep budgets, there is a whole discipline of accounting to make sure it's spent wisely. But somehow, for time, we just eyeball it. When tracking time, the objective isn't to set a number and play "number go up." The objective is to understand where you spend your time and help you prioritize and plan better. AI is estimated to increase workforce productivity by 5%.[3] Imagine the increase of productivity if everybody would be better at planning and prioritization. One last reason that is often overlooked: Tracking time can reduce anxiety and guilt. We often feel like we should "do more" but there is always more to do. By setting realistic time-based goals like "work 4h/d on project X" we have a clear measure when we achieved the goal and have also full control over the outcome. If you want to dive deeper than just these handwavy arguments into why it's useful, check out the LW post by Lynette, or the discussion
 ·  · 11m read
 · 
Acknowledgements: A huge thank you to the Hive team and the many community builders who have shared their wisdom with us over the years. This post is an attempt to synthesize those lessons. Special thanks to Therese Veith, Gergő Gáspár, Sam Chapman, Sarah Tegeler, and John Salter for reviewing this post. All mistakes and oversights are our own. TL;DR: This post walks through some of our key insights from building Hive, a global community for farmed animal advocates, and other communities within and outside of the EA space. We walk through a three-phase approach to community building, and extensive notes in the footnotes. In short: 1. Phase 0: Solve a Problem, Don't Just Start a Group. Before you build anything, a specific group of people needs a compelling reason to gather. 2. Phase 1: The Cold Start Problem & Initial Growth. Obsessing over your first users, doing unscalable things to provide value, and being strategic about who you invite in. 3. Phase 2: Scaling, Setting up Systems, and Navigating the Messy Middle. This involves handling growth, building systems, and cultivating a culture that lasts. 4. Phase 3: Maintaining and Supporting the Ecosystem. This involves continuously improving your community, staying hands-on, and seeing your community as infrastructure for the wider movement. We also cover common pitfalls we've seen, like underestimating the workload and losing focus, as well as topics like handling conflict, preventing burnout, and knowing when to stop or sunset a community. Why We're Writing This Lately, we've been getting a lot of questions about how to start and grow communities. This post is our attempt to summarize the lessons we've learned while building Hive. This isn't a definitive guide, but rather a collection of reflections, tips, and mental models that have helped us. Community building is ultimately personal work, but we hope that by sharing our journey, we can help others navigate theirs. While our experience is rooted in EA