Pronouns: she/her or they/them.
I got interested in effective altruism back before it was called effective altruism, back before Giving What We Can had a website. Later on, I got involved in my university EA group and helped run it for a few years. Now I’m trying to figure out where effective altruism can fit into my life these days and what it means to me.
I write on Substack, and used to write on Medium.
Tesla’s production fleet is constrained by the costs of production hardware, but their internal test fleet or robotaxi fleet could easily use $100,000+ hardware if they wanted. If this were enough for dramatically better performance, that would make for a flashy demo, which would probably be great for Tesla’s share price, so they are incentivized to do this.
What’s your prediction about when AI will write 90% of commercial, production code? If you think it’s within in a year from now, you can put me on the record as predicting that won’t happen.
It’s not just self-driving or coding where AI isn’t living up to the most optimistic expectations. There has been very little success in using LLMs and generative AI tools for commercial applications across the board. Demand for human translators has continued to increase since GPT-4 was released (although counterfactually it may have grown less than it would have otherwise). You’d think if generative AI were good at any commercially valuable task, it would be translation. (Customer support chat is another area with some applicability, but results are mixed, and LLMs are only an incremental improvement over the Software 1.0 chatbots and pre-LLM chatbots that already existed.) This is why I say we’re most likely in an AI bubble. It’s not just optimistic expectations in a few domains that have gotten ahead of their skis, it’s across the aggregate of all commercially relevant domains.
One more famous AI prediction I didn’t mention in this post is the Turing Award-winning AI researcher Geoffrey Hinton’s prediction in 2016 that deep learning would automate all radiology jobs by 2021. Even in 2026, he couldn’t be more wrong. Demand for radiologists and radiologists’ salaries have been on the rise. We should be skeptical of brazen predictions about what AI will soon be able to do, even from AI luminaries, given how wrong they’ve been before.
In footnote 2 on this post, I said I wouldn’t be surprised if, on January 1, 2026, the top score on ARC-AGI-2 was still below 60%. It did turn out to be below 60%, although only by 6%. (Elon Musk’s prediction of AGI in 2025 was wrong, obviously.)
The score the ARC Prize Foundation ascribes to human performance is 100%, rather than 60%. 60% is the average for individual humans, but 100% is the score for a "human panel", i.e. a set of at least two humans. Note the large discrepancy between the average human and the average human panel. The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up. (I vaguely remember this being mentioned in a talk or interview somewhere.)
ARC’s Grand Prize requires scoring 85% (and abiding by certain cost/compute efficiency limits). They say the 85% target score is "somewhat arbitrary".
I decided to go with the 60% figure in this post to go easy on the LLMs.
If you haven’t already, I recommend looking at some examples of ARC-AGI-2 tasks. Notice how simple they are. These are just little puzzles. They aren’t that complex. Anyone can do one in a few minutes, even a kid. It helps to see what we’re actually measuring here.
The computer scientist Melanie Mitchell has a great recent talk on this. The whole talk is worth watching, but the part about ARC-AGI-1 and ARC-AGI-2 starts at 21:50. She gives examples of the sort of mistakes LLMs (including o1-pro) make on ARC-AGI tasks and her team’s variations on them. These are really, really simple mistakes. I think you should really look at the example tasks and the example mistakes to get a sense of how rudimentary LLMs’ capabilities are.
I am interested to watch when ARC-AGI-3 launches. ARC-AGI-3 is interactive and there is more variety in the tasks. Just as AI models themselves need to be iterated, benchmarks need to be iterated. It is difficult to make a perfect product or technology on the first try. So, hopefully François Chollet and his colleagues will make better and better benchmarks with each new version of ARC-AGI.
Unfortunately, the AI researcher Andrew Karpathy has been saying some pretty discouraging things about benchmarks lately. From a tweet from November:
I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.
I guess the most egregious publicly known example of an LLM company juicing its numbers on benchmarks was when Meta gamed (cheated on?) some benchmarks with Llama 4. Meta AI’s former chief scientist, Yann LeCun, said in a recent interview that Mark Zuckerberg "basically lost confidence in everyone who was involved in this" (which didn’t include LeCun, who worked in a different division), many of whom have since departed the company.
However, I don’t know where LLM companies draw the line between acceptable gaming (or cheating) and unacceptable gaming (or cheating). For instance, I don’t know if LLM companies are creating their own training datasets with their own versions of ARC-AGI-2 tasks and training on that. It may be that the more an LLM company pays attention to and cares about a benchmark, the less meaningful a measurement it is (and vice versa).
Karpathy again, this time in his December LLM year in review post:
Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.
I think probably one of the best measures of AI capabilities is AI’s ability to do economically useful or valuable tasks, in real world scenarios, that can increase productivity or generate profit. This is a more robust test — it isn’t automatically gradable, and it would be very difficult to game or cheat on. To misuse the roboticist Rodney Brooks’ famous phrase, "The world is its own best model." Rather than test on some simplified, contrived proxy for real world tasks, why not just test on real world tasks?
Moreover, someone has to pay for people to create benchmarks, and to maintain, improve, and operate them. There isn’t a ton of money to do so, especially not for benchmarks like ARC-AGI-2. But there’s basically unlimited money incentivizing companies to measure productivity and profitability, and to try out allegedly labour-saving technologies. After the AI bubble pops (which it inevitably will, probably sometime within the next 5 years or so), this may become less true. But for now, companies are falling over themselves to try to implement and profit from LLMs and generative AI tools. So, funding to test AI performance in real world contexts is currently in abundant supply.
The economist Tyler Cowen linked to my post on self-driving cars, so it ended up getting a lot more readers than I ever expected. I hope that more people now realize, at the very least, self-driving cars are not an uncontroversial, uncomplicated AI success story. In discussions around AGI, people often say things along the lines of: ‘deep learning solved self-driving cars, so surely it will be able to solve many other problems'. In fact, the lesson to draw is the opposite: self-driving is too hard a problem for the current cutting edge in deep learning (and deep reinforcement learning), and this should make us think twice before cavalierly proclaiming that deep learning will soon be able to master even more complex, more difficult tasks than driving.
Thanks.
Unfortunately, patient philanthropy is the sort of topic where it seems like what a person thinks about it depends a lot on some combination of a) their intuitions about a few specific things and b) a few fundamental, worldview-level assumptions. I say "unfortunately" because this means disagreements are hard to meaningfully debate.
For instance, there are places where the argument either pro or con depends on what a particular number is, and since we don’t know what that number actually is and can’t find out, the best we can do is make something up. (For example, whether, in what way, and by how much foundations created today will decrease in efficacy over long timespans.)
Many people in the EA community are content to say, e.g., the chance of something is 0.5% rather than 0.05% or 0.005%, and rather than 5% or 50%, simply based on an intuition or intuitive judgment, and then make life-altering, aspirationally world-altering decisions based on that. My approach is more similar to the approach of mainstream academic publishing, in which if you can’t rigorously justify a number, you can’t use it in your argument — it isn’t admissible.
So, this is a deeper epistemological, philosophical, or methodological point.
One piece of evidence that supports my skepticism of numbers derived from intuition is a forecasting exercise where a minor difference in how the question was framed changed the number people gave by 5-6 orders of magnitude (750,000x). And that’s only one minor difference in framing. If different people disagree on multiple major, substantive considerations relevant to deriving a number, perhaps in some cases their numbers could differ by much more. If we can’t agree on whether a crucial number is a million times higher or lower, how constructive are such discussions going to be? Can we meaningfully say we are producing knowledge in such instances?
So, my preferred approach when an argument depends on an unknowable number is to stop the argument right there, or at least slow it down and proceed with caution. And the more of these numbers an argument depends on, the more I think the argument just can’t meaningfully support its conclusion, and, therefore, should not move us to think or act differently.
I’m only giving this topic a very cursory treatment, so I apologize for that.
I wrote this post quickly without much effort or research, and it’s just intended as a casual forum post, not anything approaching the level of an academic paper.
I’m not sure whether you’re content to make a narrow, technical, abstract point — that’s fine if so, but not what I intended to discuss here — or whether you’re trying to make a full argument that patient philanthropy is something we should actually do in practice. The latter sort of argument (which is what I wanted to address in this post) opens up a lot of considerations that the former does not.
There are many things that can’t be meaningfully modelled with real data, such as:
What’s the probability that patient philanthropy will be outlawed even in countries like England if patient philanthropic foundations try to use it to accumulate as much wealth and power as simple extrapolation implies? (My guess: ~100%.)
What’s the probability that patient philanthropy, if it’s not outlawed, would eventually contribute significantly to repugnant, evil outcomes like illiberalism, authoritarianism, plutocracy, oligarchy, and so on? (My guess: ~100%. So, patient philanthropy should be considered a catastrophic risk in any countries where it is adopted.)
What’s the risk of patient philanthropic foundations based in Western, developed countries like England holding money on behalf of recipients in developing countries such as in sub-Saharan Africa doing a worse job than if those same foundations or some equivalent or counterpart or substitute institution or intervention were based in the recipient countries? And with majority control by people from the recipient countries? (My guess: the risk is high enough that it’s preferable to move the money from the donor countries to the recipient countries from the outset.)
How much do we value things like freedom, autonomy, equality, empowerment, democracy, non-paternalism, and so on? How much do we value them on consequentialist grounds? Do we value them at all on non-consequentialist grounds? How does the importance of these considerations compare to the importance of other measures of impact such as the cost per life saved or the cost per QALY or DALY or similar measures? (My opinion: even just on consequentialist grounds alone, there are incredibly strong reasons to value these things, such that narrow cost-effectiveness calculations of the GiveWell style can’t hope to capture the full picture of what’s important.)
Under what assumptions about the future does the case for patient philanthropy break down? E.g., what do you have to assume about AGI or transformative AI? What do you have to assume about economic development in poor countries? Etc. (And how should we handle the uncertainty around this?)
What difference do philosophical assumptions make, such as a more deterministic view of history versus a view that places much greater emphasis on the agency, responsibility, and power of individuals and organizations? (My hunch: the latter makes certain arguments one might make for doing patient philanthropy in practice less attractive.)
These questions might all be irrelevant to what you want to say about patient philanthropy, but I think they are the sort of questions we have to consider if we are wondering about whether to actually do patient philanthropy in practice.
I was more hopeful when I wrote this post that it would be possible to talk meaningfully about patient philanthropy in a more narrow, technical, abstract way, but after discussing it with Jason and others, I realize that the possibility space is far too large to do that — we end up essentially discussing anything that anyone imagines might plausibly happen in the distant future, as well as fundamental differences in worldviews — and it’s impossible to avoid messier, less elegant arguments, including highly uncertain speculation about future scenarios, and including arguments of a philosophical, moral, social, and political nature.
I want to clarify I wasn’t trying to respond directly to your work or do it justice; rather, I was trying to address a more general question about whether we should actually do patient philanthropy in practice, all things considered. I cited you as the originator of patient philanthropy because it’s important to cite where ideas come from, but I hope I didn’t give readers the impression I was trying to represent your work well or give it a fair shake. I was not really doing that, I was just using it as a jumping-off point for a broader discussion. I apologize if I didn’t make that clear enough in the post, and could maybe edit it if that needs to be made clearer.
That’s an important point of clarification, thanks. I always appreciate your comments, Mr. Denkenberger.
There’s the idea of economic stimulus. John Maynard Keynes said that it would be better to spend stimulus money on useful projects (e.g. building houses), but as an intellectual provocation to illustrate his point, he said that if there were no better option, the government should pay people to dig holes in the ground and fill them back up again. Stimulating the economy is its own goal distinct from what the money actually gets spent to directly accomplish.
AI spending is an economic stimulus. Even if the data centres sit idle and never do anything economically valuable or useful — the equivalent of holes dug in the ground that were just filled up again — it could have a temporarily favourable effect on the economy and help prevent a recession. That seems like it’s probably been true so far. The U.S. economy looks recessarionary if you subtract the AI numbers.
However, we have to consider the counterfactual. If investors didn’t put all this money into AI, what would have happened? Of course, it’s hard to say. Maybe they just would have sat on their money, in which case the stimulus wouldn’t have happened, and maybe a recession would have begun by now. That’s possible. Alternatively, investors might have found a better use for their money, could have found more productive investments.
Regardless of what happens in the future, I don’t know if we’ll ever be able to know for sure what would have happened if there hadn’t been this AI investment craze. So, who knows.
(I think there are many things to invest in that would have been better choices than AI, but the question is whether, in a counterfactual scenario without the current AI exuberance, investors actually would have gone for any of them. Would they have invested enough in other things to stimulate the economy enough to avoid a recession?)
The stronger point, in my opinion, is that I don’t think anyone would actually defend spending on data centres just as an economic stimulus, rather than as an investment with an equal or better ROI as other investments. So, the general rule we all agree we want to follow is: invest in things with a good ROI, and don’t just dig and fill up holes for the sake of stimulus. Maybe there are cases where large investment bubbles prevent recessions, but no one would ever argue: hey, we should promote investment bubbles when growth is sluggish to prevent recessions! Even if there are one-off instances where that gambit pays off, statistically, overall, over the long term, that’s going to be a losing strategy.[1]
Only semi-relatedly, I’m fond of rule consequentialism as an alternative to act consequentialism. Leaving aside really technical and abstract considerations about which theory is better or more correct, I think, in practice, following the procedure 'follow the rule that will overall lead to the best consequences over the set of all acts' is a better idea than the procedure 'choose the act that will lead to the best consequences in this instance'. Given realistic ideas about humans actually think, feel, and behave in real life situations, I think the 'follow the rule' procedure tends to lead to better outcomes than the 'choose the act' procedure. The 'choose the act' procedure all too easily opens the door to motivated reasoning or just sloppy reasoning, and sometimes gives people, in their minds, a moral license to embrace evil or madness.
The necessary caveat: of course course, life is more complicated than either of these procedures allow, and there’s a lot of discernment that needs to be used on a case-by-case basis. (E.g., just individuating acts and categories of acts and deciding which rules apply to the situation you find yourself in is complicated enough. And there are rare, exceptional circumstances in which the normal rules might not make sense anymore.)
Whenever someone tries to justify something that seems crazy or wrong, like something deceptive, manipulative, or Machiavellian, on consequentialist grounds, which typically you only see in fiction, but you also see on rare occasions in real life (and unfortunately sometimes in mild forms in the EA community), I always see the same sort of flaws in the reasoning. The choice is typically presented as a false binary: e.g. spend $100 billion on AI data centres as an economic stimulus or do nothing.
This type of thinking overlooks that the number of possible options is almost always immensely large, and is mostly filled up by options you can’t currently imagine. People are creative and intelligent to the point of being unpredictable by you (or by anyone), so you simply can’t anticipate the alternative options that might arise if you don’t ram through your 'for the greater good' plan. But, anyway, that’s a big philosophical digression.
I typically don’t agree with much that Dwarkesh Patel, a popular podcaster, says about AI,[1] but his recent Substack post makes several incisive points, such as:
Somehow this automated researcher is going to figure out the algorithm for AGI - a problem humans have been banging their head against for the better part of a century - while not having the basic learning capabilities that children have? I find this super implausible.
Yes, exactly. The idea of a non-AGI AI researcher inventing AGI is a skyhook. It’s pulling yourself up by your bootstraps, a borderline supernatural idea. It’s retrocausal. It just doesn’t make sense.
There are more great points in the post besides that, such as:
Currently the labs are trying to bake in a bunch of skills into these models through “mid-training” - there’s an entire supply chain of companies building RL environments which teach the model how to navigate a web browser or use Excel to write financial models.
Either these models will soon learn on the job in a self directed way - making all this pre-baking pointless - or they won’t - which means AGI is not imminent. Humans don’t have to go through a special training phase where they need to rehearse every single piece of software they might ever need to use.
… You don’t need to pre-bake the consultant’s skills at crafting Powerpoint slides in order to automate Ilya [Sutskever, an AI researcher]. So clearly the labs’ actions hint at a world view where these models will continue to fare poorly at generalizing and on-the-job learning, thus making it necessary to build in the skills that they hope will be economically valuable.
And:
It is not possible to automate even a single job by just baking in some predefined set of skills, let alone all the jobs.
We are in an AI bubble, and AGI hype is totally misguided.
There are some important things I disagree with in Dwarkesh's post, too. For example, he says that AI has solved "general understanding, few shot learning, [and] reasoning", but AI has absolutely not solved any of those things.
Models lack general understanding, and the best way to see that is they can't do much useful in complex, real world contexts — which is one of the points Dwarkesh is making in the post. Few-shot learning only works well in situations where a model has already been trained on a giant amount of similar training examples. The "reasoning" in "reasoning models" is, in Melanie Mitchell's terminology, a wishful mnemonic. In other words, just naming an AI system something doesn't mean it can actually do the thing it's named after. If Meta renamed Llama 5 to Superintelligence 1, that wouldn't make Llama 5 a superintelligence.
I also think Dwarkesh is astronomically too optimistic about how economically impactful AI will be by 2030. And he's overfocusing on continual learning as the only research problem that needs to be solved, to the neglect of others.
Dwarkesh's point about the variance in the value of human labour and the O-ring theory in economics also doesn't seem to make sense, if I'm understanding his point correctly. If we had AI models that were genuinely as intelligent as the median human, the economic effects would be completely disruptive and transformative in much the way Dwarkesh describes earlier in the post. General intelligence at the level of the median human would be enough to automate a lot of knowledge work.
The idea that you need AI systems equivalent to the top percentile of humans in intelligence or skill or performance or whatever before you can start automating knowledge work doesn't make sense, since most knowledge workers aren't in the top percentile of humans. This is such an obvious point that I worry I'm just misunderstanding the point Dwarkesh was trying to make.