Yarrow Bouchard 🔸

Highlights from Ilya Sutskever’s November 2025 interview with Dwarkesh Patel

· 2y ago · 1m read

· 2d ago · 6m read

Is the AI Industry in a Bubble?

· 5d ago · 9m read

A major flaw in the Forecasting Research Institute's "Longitudinal Expert AI Panel" survey

· 12d ago · 18m read

Microsoft's CEO Satya Nadella says he doesn't believe in AGI

· 14d ago · 6m read

Is "AI as Normal Technology" a misnomer?

· 15d ago · 5m read

Frozen skills aren't general intelligence

· 15d ago · 3m read

Prudential longtermism is defanged by the strategy of procrastination — and that’s not all

· 19d ago · 13m read

· 21d ago · 11m read

Disciplined iconoclasm

Even after GPT-4, AI researchers forecasted a 50% chance of AGI by 2047 or 2116, depending how you define AGI

· 24d ago · 8m read

· 1mo ago · 4m read

Sequences
2

Criticism of specific accounts of imminent AGI

Skepticism about near-term AGI

Comments
482

Topic contributions
2

Yarrow Bouchard 🔸1mo*-34

Community

Yarrow Bouchard 🔸16h6

Global health

The NPR podcast Planet Money just released an episode on GiveWell.

Yarrow Bouchard 🔸17h*1

The V-JEPA 2 abstract explains this:

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.

I don't think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. It's more about the underlying technical ideas in V-JEPA 2 — Yann LeCun has explained the rationale for these ideas — and where they could ultimately go given further research.

I'm very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly don't measure those things successfully.

The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests don't have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.

I don't think anybody really believes IQ tests actually prove LLMs are AGIs, which is why it's a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I don't think the reasoning is any more valid with those benchmarks than it is for IQ tests.

Benchmarks are useful for measuring certain things; I'm not trying to argue with narrow interpretations. I'm specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isn't valid with IQ tests and it isn't valid with most benchmarks.

Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence — which they aren't.

Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical — good robustness!)

Usually, when we measure AI performance on some dataset or some set of tasks, we don't do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)

Economic performance is a more robust test of AI capabilities than almost anything else. However, it's also a harsh and unforgiving test, which doesn't allow us to measure early progress.

Yarrow Bouchard 🔸20h0

Possibly something like V-JEPA 2, but in that case I'm just going off of Meta touting its own results, and I would want to hear opinions from independent experts.

Maria Evans's Quick takes

Yarrow Bouchard 🔸1d0

I didn’t say that pixel-to-pixel prediction or other low-level techniques haven’t made incremental progress. I said that this approach is ultimately forlorn — if the goal is human-level computer vision for robotics applications or AGI that can see — and LLMs didn’t make any progress on any alternative approaches.

Yarrow Bouchard 🔸1d2

Unfortunately, some people use the karma downvote as a disagree vote, even though it’s supposed to be used to indicate the quality of a contribution, rather than whether you agree or disagree.

If you see a comment you disagree with but is civil and attempts to make a constructive contribution to the discussion, ideally you should disagree vote it and either not karma vote it or karma upvote it. But some people will karma downvote.

Samin's Quick takes

Yarrow Bouchard 🔸1d2

My instantaneous, knee-jerk reaction (so take it with a grain of salt) is that the Red Queen Bio co-founder’s responses are satisfactory and reassuring. Your concerns are based on an unsourced rumour and speculation, which are always in unlimited supply and don’t warrant a response from a company in every case.

You also don’t seem to be updating rationally on the responses you are receiving, but just doubling-down on your original hunch, which by now seems like it’s probably false.

Not all tweets merit a response, so it doesn’t matter whether they continue to answer your questions or not.