Pronouns: she/her or they/them.
I got interested in effective altruism back before it was called effective altruism, back before Giving What We Can had a website. Later on, I got involved in my university EA group and helped run it for a few years. Now I’m trying to figure out where effective altruism can fit into my life these days and what it means to me.
The V-JEPA 2 abstract explains this:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.
I don't think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. It's more about the underlying technical ideas in V-JEPA 2 — Yann LeCun has explained the rationale for these ideas — and where they could ultimately go given further research.
I'm very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly don't measure those things successfully.
The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests don't have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.
I don't think anybody really believes IQ tests actually prove LLMs are AGIs, which is why it's a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I don't think the reasoning is any more valid with those benchmarks than it is for IQ tests.
Benchmarks are useful for measuring certain things; I'm not trying to argue with narrow interpretations. I'm specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isn't valid with IQ tests and it isn't valid with most benchmarks.
Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence — which they aren't.
Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical — good robustness!)
Usually, when we measure AI performance on some dataset or some set of tasks, we don't do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)
Economic performance is a more robust test of AI capabilities than almost anything else. However, it's also a harsh and unforgiving test, which doesn't allow us to measure early progress.
Possibly something like V-JEPA 2, but in that case I'm just going off of Meta touting its own results, and I would want to hear opinions from independent experts.
I didn’t say that pixel-to-pixel prediction or other low-level techniques haven’t made incremental progress. I said that this approach is ultimately forlorn — if the goal is human-level computer vision for robotics applications or AGI that can see — and LLMs didn’t make any progress on any alternative approaches.
Unfortunately, some people use the karma downvote as a disagree vote, even though it’s supposed to be used to indicate the quality of a contribution, rather than whether you agree or disagree.
If you see a comment you disagree with but is civil and attempts to make a constructive contribution to the discussion, ideally you should disagree vote it and either not karma vote it or karma upvote it. But some people will karma downvote.
My instantaneous, knee-jerk reaction (so take it with a grain of salt) is that the Red Queen Bio co-founder’s responses are satisfactory and reassuring. Your concerns are based on an unsourced rumour and speculation, which are always in unlimited supply and don’t warrant a response from a company in every case.
You also don’t seem to be updating rationally on the responses you are receiving, but just doubling-down on your original hunch, which by now seems like it’s probably false.
Not all tweets merit a response, so it doesn’t matter whether they continue to answer your questions or not.
Self-driving cars are not close to getting solved. Don’t take my word for it. Listen to Andrej Karpathy, the lead AI researcher responsible for the development of Tesla’s Full Self-Driving software from 2017 to 2022. (Karpathy also did two stints as a researcher at OpenAI, taught a deep learning course at Stanford, and coined the term "vibe coding".)
From Karpathy’s October 17, 2025 interview with Dwarkesh Patel:
Dwarkesh Patel 01:42:55
You’ve talked about how you were at Tesla leading self-driving from 2017 to 2022. And you firsthand saw this progress from cool demos to now thousands of cars out there actually autonomously doing drives. Why did that take a decade? What was happening through that time?
Andrej Karpathy 01:43:11
One thing I will almost instantly push back on is that this is not even near done, in a bunch of ways that I’m going to get to. Self-driving is very interesting because it’s definitely where I get a lot of my intuitions because I spent five years on it. It has this entire history where the first demos of self-driving go all the way to the 1980s. You can see a demo from CMU in 1986. There’s a truck that’s driving itself on roads.
Fast forward. When I was joining Tesla, I had a very early demo of Waymo. It basically gave me a perfect drive in 2014 or something like that, so a perfect Waymo drive a decade ago. It took us around Palo Alto and so on because I had a friend who worked there. I thought it was very close and then it still took a long time.
For some kinds of tasks and jobs and so on, there’s a very large demo-to-product gap where the demo is very easy, but the product is very hard. It’s especially the case in cases like self-driving where the cost of failure is too high. Many industries, tasks, and jobs maybe don’t have that property, but when you do have that property, that definitely increases the timelines.
For example, in software engineering, I do think that property does exist. For a lot of vibe coding, it doesn’t. But if you’re writing actual production-grade code, that property should exist, because any kind of mistake leads to a security vulnerability or something like that. Millions and hundreds of millions of people’s personal Social Security numbers get leaked or something like that. So in software, people should be careful, kind of like in self-driving. In self-driving, if things go wrong, you might get injured. There are worse outcomes. But in software, it’s almost unbounded how terrible something could be.
I do think that they share that property. What takes the long amount of time and the way to think about it is that it’s a march of nines. Every single nine is a constant amount of work. Every single nine is the same amount of work. When you get a demo and something works 90% of the time, that’s just the first nine. Then you need the second nine, a third nine, a fourth nine, a fifth nine. While I was at Tesla for five years or so, we went through maybe three nines or two nines. I don’t know what it is, but multiple nines of iteration. There are still more nines to go.
That’s why these things take so long. It’s definitely formative for me, seeing something that was a demo. I’m very unimpressed by demos. Whenever I see demos of anything, I’m extremely unimpressed by that. If it’s a demo that someone cooked up just to show you, it’s worse. If you can interact with it, it’s a bit better. But even then, you’re not done. You need the actual product. It’s going to face all these challenges when it comes in contact with reality and all these different pockets of behavior that need patching.
We’re going to see all this stuff play out. It’s a march of nines. Each nine is constant. Demos are encouraging. It’s still a huge amount of work to do. It is a critical safety domain, unless you’re doing vibe coding, which is all nice and fun and so on. That’s why this also enforced my timelines from that perspective.
Karpathy elaborated later in the interview:
The other aspect that I wanted to return to is that self-driving cars are nowhere near done still. The deployments are pretty minimal. Even Waymo and so on has very few cars. They’re doing that roughly speaking because they’re not economical. They’ve built something that lives in the future. They’ve had to pull back the future, but they had to make it uneconomical. There are all these costs, not just marginal costs for those cars and their operation and maintenance, but also the capex of the entire thing. Making it economical is still going to be a slog for them.
Also, when you look at these cars and there’s no one driving, I actually think it’s a little bit deceiving because there are very elaborate teleoperation centers of people kind of in a loop with these cars. I don’t have the full extent of it, but there’s more human-in-the-loop than you might expect. There are people somewhere out there beaming in from the sky. I don’t know if they’re fully in the loop with the driving. Some of the time they are, but they’re certainly involved and there are people. In some sense, we haven’t actually removed the person, we’ve moved them to somewhere where you can’t see them.
I still think there will be some work, as you mentioned, going from environment to environment. There are still challenges to make self-driving real. But I do agree that it’s definitely crossed a threshold where it kind of feels real, unless it’s really teleoperated. For example, Waymo can’t go to all the different parts of the city. My suspicion is that it’s parts of the city where you don’t get good signal. Anyway, I don’t know anything about the stack. I’m just making stuff up.
Dwarkesh Patel 01:50:23
You led self-driving for five years at Tesla.
Andrej Karpathy 01:50:27
Sorry, I don’t know anything about the specifics of Waymo. By the way, I love Waymo and I take it all the time. I just think that people are sometimes a little bit too naive about some of the progress and there’s still a huge amount of work. Tesla took in my mind a much more scalable approach and the team is doing extremely well. I’m kind of on the record for predicting how this thing will go. Waymo had an early start because you can package up so many sensors. But I do think Tesla is taking the more scalable strategy and it’s going to look a lot more like that. So this will still have to play out and hasn’t. But I don’t want to talk about self-driving as something that took a decade because it didn’t take it yet, if that makes sense.
Dwarkesh Patel 01:51:08
Because one, the start is at 1980 and not 10 years ago, and then two, the end is not here yet.
Andrej Karpathy 01:51:14
The end is not near yet because when we’re talking about self-driving, usually in my mind it’s self-driving at scale. People don’t have to get a driver’s license, etc.
I hope the implication for discussions around AGI timelines is clear.
For an optimistic take, I loved this video from Simon Clark, an atmospheric physics PhD and climate science educator.
I used to like Grammarly for checking spelling, grammar, punctuation, and copy editing things, but it seems like it’s gone downhill since switching to an LLM-based software. Google Docs is decent for catching basic things like typos, accidentally missing a word, accidentally repeating a word, subject/verb agreement, etc.
I actually don’t agree with the LLM’s changes in the two examples you mentioned and I think it made the writing worse in both cases. The LLM’s diction is staid and corporate, it lacks energy.