This is François Chollet's keynote talk at the AGI-24 conference in Seattle, Washington in August 2024. 

Chollet is an AI researcher who may be best known for creating the deep learning library Keras. He did deep learning research and software development for Keras at Google for nine years before leaving recently to create his own AI startup. In September 2024, Time named him one of the 100 most influential people in AI. 

In this talk, Chollet describes what he sees as the fundamental weaknesses in large language models (LLMs), the flaws in the commonly used benchmarks for LLM performance, and argues why LLMs are incapable of scaling to artificial general intelligence (AGI). He also argues that apparent progress by LLMs in some of their weak areas is the result of superficial, brittle fixes by human annotators, which is a non-scalable and labour-intensive approach. 

Chollet's opinion that LLMs won't scale to AGI appears to be the view of a majority of AI experts. A March 2025 report from the Association for the Advancement of Artificial Intelligence (AAAI) found the following after surveying 475 AI experts (page 63): 

The majority of respondents (76%) assert that “scaling up current AI approaches” to yield AGI is “unlikely” or “very unlikely” to succeed, suggesting doubts about whether current machine learning paradigms are sufficient for achieving general intelligence.

The approach to AGI that Chollet favours is a combination of deep learning and program synthesis


Related post: ARC-AGI-2 Overview With François Chollet

6

0
0

Reactions

0
0
Comments2


Sorted by Click to highlight new comments since:

With Chollet acknowledging that o1/o3 (and ARC 1 getting beaten) was a significant breakthrough, how much is this talk now outdated vs still relevant?

I think it’s still very relevant! I don’t think this talk’s relevance has diminished. It’s just important to also have that more recent information about o3 in addition to what’s in this talk. (That’s why I linked the other talk at the bottom of this post.)

By the way, I think it’s just o3 and not o1 that achieves the breakthrough results on ARC-AGI-1. It looks like o1 only gets 32% on ARC-AGI-1, whereas the lower-compute version of o3 gets around 76% and the higher-compute version gets around 87%.

The lower-compute version of o3 only gets 4% on ARC-AGI-2 in partial testing (full testing has not yet been done) and the higher-compute version has not yet been tested.

Chollet speculates in this blog post about how o3 works (I don’t think OpenAI has said much about this) and how that fits in to his overall thinking about LLMs and AGI:

Why does o3 score so much higher than o1? And why did o1 score so much higher than GPT-4o in the first place? I think this series of results provides invaluable data points for the ongoing pursuit of AGI.

My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.

This "memorize, fetch, apply" paradigm can achieve arbitrary levels of skills at arbitrary tasks given appropriate training data, but it cannot adapt to novelty or pick up new skills on the fly (which is to say that there is no fluid intelligence at play here.) This has been exemplified by the low performance of LLMs on ARC-AGI, the only benchmark specifically designed to measure adaptability to novelty – GPT-3 scored 0, GPT-4 scored near 0, GPT-4o got to 5%. Scaling up these models to the limits of what's possible wasn't getting ARC-AGI numbers anywhere near what basic brute enumeration could achieve years ago (up to 50%).

To adapt to novelty, you need two things. First, you need knowledge – a set of reusable functions or programs to draw upon. LLMs have more than enough of that. Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis. LLMs have long lacked this feature. The o series of models fixes that.

For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model. To note, Demis Hassabis hinted back in a June 2023 interview that DeepMind had been researching this very idea – this line of work has been a long time coming.

So while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programs, where the program itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination (you could also do test-time training, or search in latent space), it represents the current state-of-the-art as per these new ARC-AGI numbers.

Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.

There are however two significant differences between what's happening here and what I meant when I previously described "deep learning-guided program search" as the best path to get to AGI. Crucially, the programs generated by o3 are natural language instructions (to be "executed" by a LLM) rather than executable symbolic programs. This means two things. First, that they cannot make contact with reality via execution and direct evaluation on the task – instead, they must be evaluated for fitness via another model, and the evaluation, lacking such grounding, might go wrong when operating out of distribution. Second, the system cannot autonomously acquire the ability to generate and evaluate these programs (the way a system like AlphaZero can learn to play a board game on its own.) Instead, it is reliant on expert-labeled, human-generated CoT data.

It's not yet clear what the exact limitations of the new system are and how far it might scale. We'll need further testing to find out. Regardless, the current performance represents a remarkable achievement, and a clear confirmation that intuition-guided test-time search over program space is a powerful paradigm to build AI systems that can adapt to arbitrary tasks.

Curated and popular this week
 ·  · 13m read
 · 
It’s to build the skills required to solve the problems that you want to solve in the world [I am a career advisor at 80,000 Hours. This post is adapted from a talk I gave on career capital to some ambitious altruistic students. If you prefer slides, you can access them here. These ideas are informed by my work at 80k but reflects my personal views.] I’m often asked how to have an impactful fulfilling career. My four word answer is “get good, be known.” My “fits on a postcard” answer is something like this: 1. Identify a problem with a vast scale of harm that is neglected at current margins and that is tractable to solve. Millions will die of preventable diseases this year, billions of animals will be tortured, severe tail risks like nuclear war and catastrophic pandemics still exist, and we might be on the cusp of a misaligned intelligence explosion. You should find an important problem to work on. 2. Obsessively improve at the rare and valuable skills to solve this problem and do so in a legible way for others to notice. Leverage this career capital to keep the flywheel going— skills allow you to solve more problems, which builds more skills. Rare and valuable roles require rare and valuable traits, so get so good they can’t ignore you. Unfortunately, some ambitious and altruistic young people that I speak to seem to have implicitly developed a model that looks more like this: 1. Identify a problem with a vast scale of harm, that is neglected at current margins, and that is tractable to solve. 2. Get a job from the 80,000 Hours job board at a capital-E capital-A Effective Altruist organization right out of college, as fast as possible, otherwise feel like a failure, oh god, oh god... I empathize with this feeling. Ambitious people who care about reducing risk and suffering in the world understandably think it’s the most important thing they can be doing, and often hold themselves to a high standard when trying to get there. Before properly entering the w
 ·  · 6m read
 · 
I really enjoyed reading the "why I donate" posts in the past week, so much so that I felt compelled to add my reflections, in case someone finds my reasons as interesting as I found theirs. 1. My money must be spent on something, might as well spend it on the most efficient things The core reason I give is something that I think is under-represented in the other posts: the money I have and earn will need to be spent on something, and it feels extremely inefficient and irrational to spend it on my future self when it can provide >100x as much to others. To me, it doesn't seem important whether I'm in the global top 10% or bottom 10%, or whether the money I have is due to my efforts or to the place I was born. If it can provide others 100x as much, it just seems inefficient/irrational to allocate it to myself. The post could end here, but there are other secondary reasons/perspectives on why I personally donate that I haven't seen commonly discussed. 2. Spending money is voting on how the global economy allocates its resources In 2017, I read Wealth: The Toxic Byproduct by Kevin Simler. Surprisingly, I don't think it has ever been posted on this forum. I disagree with some of it, but the core points really changed how I think about wealth, earning, and spending. The post is very well written and enjoyable, but it's 2400 words, so copy-pasting some snippets: > A thought experiment — the Congolese Trading Window: > > Suppose one day you wake up to find a large pile of Congolese francs. [...] A window [...] pushes open to reveal the unfamiliar sights of a Congolese outdoor market [...] a man approaches your window. [...] He's asking if you'd like to buy his grain for 500 francs. > > What should you do? [...] > Your plan is to buy grain whenever you think the price is poised to go up in the near future, and sell whenever you think the price is poised to go down. > [...] > Imagine a particular bag of grain that you bought at T1 for 200 francs, and then sold at T
 ·  · 6m read
 · 
Note: This post was crossposted from the Coefficient Giving Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- It can feel hard to help factory-farmed animals. We’re up against a trillion-dollar global industry and its army of lobbyists, marketeers, and apologists. This industry wields vast political influence in nearly every nation and sells its products to most people on earth. Against that, we are a movement of a few thousand full-time advocates operating on a shoestring. Our entire global movement — hundreds of groups combined — brings in less funds in a year than one meat company, JBS, makes in two days. And we have the bigger task. The meat industry just wants to preserve the status quo: virtually no regulation and ever-growing demand for factory farming. We want to upend it — and place humanity on a more humane path. Yet, somehow, we’re winning. After decades of installing battery cages, gestation crates, and chick macerators, the industry is now removing them. Once-dominant industries, like fur farming, are collapsing. And advocates are building momentum toward bigger reforms for all farmed animals. Here are my top ten wins from this year: 1. Liberté et Égalité, for Chickens. France’s largest chicken producer, the LDC Group, committed to adopting the European Chicken Commitment for its two flagship brands by 2028 — a shift that French advocacy group L214 estimates will cover 40% of the national chicken market, or up to 400 million birds each year. Across the Channel, British supermarket chain Waitrose transitioned all its own-brand chicken to comply with the parallel UK Better Chicken Commitment. 2. Guten Cluck! The Wurst Is Over for German Animals. Germany’s top retailer, Edeka, committed to making all of its own-brand chicken products compliant with Germany’s equivalent of the European Chicken Commitment by 2030