Yarrow Bouchard🔸

If an AI financial bubble popped, how much would that change your mind about near-term AGI?

· 2y ago · 1m read

· 4d ago · 2m read

Irrecoverable collapse

Essay Competition

AGI by 2032 is extremely unlikely

· 5d ago · 4m read

OpenAI's o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

· 9d ago · 8m read

François Chollet on why LLMs won't scale to AGI

· 6mo ago · 4m read

ARC-AGI-2 Overview With François Chollet

· 6mo ago · 1m read

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

· 7mo ago · 1m read

What novel, actionable advice does longtermism offer?

· 7mo ago · 2m read

Why won’t nanotech kill us all?

· 2y ago · 1m read

If AGI is imminent, why can’t I hail a robotaxi?

· 2y ago · 1m read

· 2y ago · 2m read

Comments
264

Topic contributions
1

Yarrow Bouchard🔸2h2

And, indeed, this seems to show your accusation that there was an attempt to hide the post after you brought it up was false. An apology wouldn't hurt!

The other false accusation was that I didn't cite any sources, when in fact I did in the very first sentence of my quick take. Apart from that, I also directly linked to an EA Forum post in my quick take. So, however you slice it, that accusation is wrong. Here, too, an apology wouldn't hurt if you want to signal good faith.

My offer is still open to provide sources for any one factual claim in the quick take if you want to challenge one of them. (But, as I said, I don't want to be here all day, so please keep it to one.)

Incidentally, in my opinion, that post supports my argument about anti-LGBT attitudes on LessWrong. I don't think I could have much success persuading LessWrong users of that, however, and that was not the intention of this quick take.

Yarrow Bouchard🔸3h-2

Hey, so you're 0 for 2 on your accusations! Want to try again?

Yarrow Bouchard🔸4h2

The sources are cited in quite literally the first sentence of the quick take.

To my knowledge, every specific factual claim I made is true and none are false. If you want to challenge one specific factual claim, I would be willing to provide sources for that one claim. But I don’t want to be here all day.

Since I guess you have access to LessWrong’s logs given your bio, are you able to check when and by whom that LessWrong post was moved to drafts, i.e., if it was indeed moved to drafts after your comment and not before, and if it was, whether it was moved to drafts by the user who posted it rather than by a site admin or moderator?

Yarrow Bouchard🔸4h2

I’m not sure if you’re asking about the METR graph on task length or about the practical use of AI coding assistants, which the METR study found is currently negative.

If I understand it correctly, the METR graph doesn’t measure an exponentially decreasing failure rate, just a 50% failure rate. (There’s also a version of the graph with a 20% failure rate, but that’s not the one people typically cite.)

I also think automatically graded tasks used in benchmarks don’t usually deserve to be called “software engineering” or anything that implies that the actual tasks the LLM is doing are practically useful, economically valuable, or could actually substitute for tasks that humans get paid to do.

I think many of these LLM benchmarks are trying to measure such narrow things and such toy problems, which seem to be largely selected so as to make the benchmarks easier for LLMs, that they aren’t particularly meaningful.

In terms of studies of real world performance like METR’s study on human coders using an AI coding assistant, that’s much more interesting and important. Although I find most LLM benchmarks practically meaningless for measuring AGI progress, I think practical performance in economically valuable contexts is much more meaningful.

My point in the above comment was just that an unambiguously useful AI coding assistant would not by itself be strong evidence for near-term AGI. AI systems mastering games like chess and go is impressive and interesting and probably tells us some information about AGI progress, but if someone pointed to AlphaGo beating Lee Seedol as strong evidence that AGI would have been created within 7 years of that point, they would have been wrong.

In other words, progress in AI probably tells us something about AGI progress, but just taking impressive results in AI and saying that implies AGI within 7 years isn’t correct, or at least it’s unsupported. Why 7 years and not 17 years or 77 years or 177 years?

If you assume whatever rate of progress you like, that will support any timeline you like based on any evidence you like, but, in my opinion, that’s no way to make an argument.

On the topic of betting and investing, it’s true that index funds have exposure to AI, and indeed personally I worry about how much exposure the S&P 500 has (global index funds that include small-cap stocks have less, but I don’t know how much less). My argument in the comment above is simply that if someone thought it was rational to bet some amount of money on AGI arriving within 7 years, then surely it would be rational to invest that same amount of money in a 100% concentrated investment in AI and not, say, the S&P 500.

Yarrow Bouchard🔸1d*-14

Community

How Well Does RL Scale?

I'm not sure that Toby was wrong to work on this, but if he was, it's because if he hadn't, then someone else with more comparative advantage for working on this problem (due to lacking training or talent for philosophy) would have done so shortly afterwards.

How shortly? We're discussing this in October 2025. What's the newest piece of data that Toby's analysis is dependent on? Maybe the Grok 4 chart from July 2025? Or possibly qualitative impressions from the GPT-5 launch in August 2025? Who else is doing high-quality analysis of this kind and publishing it, even using older data?

I guess I don't automatically buy the idea that even in a few months we'll see someone else independently go through the same reasoning steps as this post and independently come to the same conclusion. But there are plenty of people who could, in theory, do it and who are, in theory, motivated to do this kind of analysis and who also will probably not see this post (e.g. equity research analysts, journalists covering AI, AI researchers and engineers independent of LLM companies).

I certainly don't buy the idea that if Toby hadn't done this analysis, then someone else in effective altruism would have done it. I don't see anybody else in effective altruism doing similar analysis. (I chalk that up largely to confirmation bias.)

How Well Does RL Scale?

Why do you think this work has less value than solving philosophical problems in AI safety? If LLM scaling is sputtering out, isn't that important to know? In fact, isn't it a strong contender for the most important fact about AI that could be known right now?

I suppose you could ask why this work hasn't been done by somebody else already and that's a really good question. For instance, why didn't anyone doing equity research or AI journalism notice this already?

Among people who are highly concerned about near-term AGI, I don't really expect such insights to be surfaced. There is strong confirmation bias. People tend to look for confirmation that AGI is coming soon and not for evidence against. So, I'm not surprised that Toby Ord is the first person within effective altruism to notice this. Most people aren't looking. But this doesn't explain why equity research analysts, AI journalists, or others who are interested in LLM scaling (such as AI researchers or engineers not working for one of the major LLM companies and not bound by an NDA) missed this. I am surprised an academic philosopher is the first person to notice this! And kudos to him for that!

Are you referring to the length of tasks that LLMs are able to complete with a 50% success rate? I don't see that as a meaningful indicator of AGI. Indeed, I would say it's practically meaningless. It truly just doesn't make sense an indicator of progress toward AGI. I find it strange that anyone thinks otherwise. Why should we see that metric as indicating AGI progress anymore than, say, the length of LLMs' context windows?

I think a much more meaningful indicator from METR would be the rate at which AI coding assistants speeds up coding tasks for human coders. Currently, METR's finding is that it slows them down by 19%. But this is asymmetric. Failing to clear a low bar like being an unambiguously useful coding assistant in such tests is strong evidence against models nearing human-level capabilities, but clearing a low bar is not strong evidence for models nearing human-level capabilities. By analogy, we might take an AI system being bad at chess as evidence that it has much less than human-level general intelligence. But we shouldn't take an AI system (such as Deep Blue or AlphaZero) being really good at chess as evidence that it has human-level or greater general intelligence.

If I wanted to settle for an indirect proxy for progress toward AGI, I could short companies like Nvidia, Microsoft, Google, or Meta (e.g. see my recent question about this), but, of course, those companies stock prices' don't directly measure AGI progress. Conversely, someone who wanted to take the other side of the bet could take a long position in those stocks. But then this isn't much of an improvement on the above. If LLMs became much more useful coding assistants, then this could help justify these companies' stock prices, but it wouldn't say much about progress toward AGI. Likewise for other repetitive, text-heavy tasks, like customer support via web chat.

It seems like the flip side should be different: if you do think AGI is very likely to be created within 7 years, shouldn't that imply a long position in stocks like Nvidia, Microsoft, Google, or Meta would be lucrative? In principle, you could believe that LLMs are some number of years away from being able to make a lot of money and at most 7 years away from progressing to AGI, and that the market will give up on LLMs making a lot of money just a few years too soon. But I would find this to be a strange and implausible view.

This is, of course, sensitive to your assumptions.