Hide table of contents

The news:

ARC Prize published a blog post on April 22, 2025 that says OpenAI's o3 (Medium) scores 2.9% on the ARC-AGI-2 benchmark.[1] As of today, the leaderboard says that o3 (Medium) scores 3.0%. The blog post says o4-mini (Medium) scores 2.3% on ARC-AGI-2 and the leaderboard says it scores 2.4%. 

The high-compute versions of the models "failed to respond or timed out" for the large majority of tasks.

The average score for humans — typical humans off the street — is 60%. All of the ARC-AGI-2 tasks have been solved by at least two humans in no more than two attempts.

From the recent blog post:         

ARC Prize Foundation is a nonprofit committed to serving as the North Star for AGI by building open reasoning benchmarks that highlight the gap between what’s easy for humans and hard for AI. The ARC‑AGI benchmark family is our primary tool to do this. Every major model we evaluate adds new datapoints to the community’s understanding of where the frontier stands and how fast it is moving.

In this post we share the first public look at how OpenAI’s newest o‑series models, o3 and o4‑mini, perform on ARC‑AGI.

Our testing shows:

  • o3 performs well on ARC-AGI-1 - o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set, and the o3-medium reached 53%. Neither surpassed 3% on ARC‑AGI‑2.
  • o4-mini shows promise - o4-mini-low scored 21% on ARC-AGI-1 Semi Private Eval, and o4-mini-medium` scored 41% at state of the art levels of efficiency. Again, both low/med scored under 3% on the more difficult ARC-AGI-2 set.
  • Incomplete coverage with high reasoning - Both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning. Partial high‑reasoning results appear below. However, these runs were excluded from the leaderboard due to insufficient coverage.

My analysis:

This is clear evidence that cutting-edge AI models have far less than human-level general intelligence. 

To be clear, scoring at human-level or higher on ARC-AGI-2 isn't evidence of human-level general intelligence and isn't intended to be. It's simply meant to be a challenging benchmark for AI models that attempts to measure models' ability to generalize to novel problems, rather than to rely on memorization to solve problems. 

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI), but if o5-mini or a future version of o4-mini is able to play hangman, that wouldn't be a sign that it is AGI.

This is also conclusive disconfirmation (as if we needed it!) of the economist Tyler Cowen's declaration that o3 is AGI. (He followed up a day later and said, "I don’t mind if you don’t want to call it AGI." But he didn't say he was wrong to call it AGI.) 

It is inevitable that over the next 5 years, many people will realize their belief that AGI will be created within the next 5 years is wrong. (Though not necessarily all, since, as Tyler Cowen showed, it is possible to declare that an AI model is AGI when it is clearly not. To avoid admitting to being wrong, in 2027 or 2029 or 2030 or whenever they predicted AGI would happen, people can just declare the latest AI model from that year to be AGI.) ARC-AGI-2 and, later on, ARC-AGI-3 can serve as a clear reminder that frontier AI models are not AGI, are not close to AGI, and continue to struggle with relatively simple problems that are easy for humans. 

If you imagine fast enough progress, then no matter how far current AI systems are from AGI, it's possible to imagine them progressing from the current level of capabilities to AGI in incredibly small spans of time. But there is no reason to think progress will be fast enough to cover the ground from o3 (or any other frontier AI model) to AGI within 5 years. 

The models that exist today are somewhat better than the models that existed 2 years ago, but only somewhat. In 2 years, the models will probably be somewhat better than today, but only somewhat. 

It's hard to quantify general intelligence in a way that allows apples-to-apples comparisons between humans and machines. If we measure general intelligence by measuring the ability to play grandmaster-level chess, well, IBM's Deep Blue could do that in 1996. If we give ChatGPT an IQ test, it will score well above 100, the average for humans. Large language models (LLMs) are good at taking written tests and exams, which is what a lot of popular benchmarks are. 

So, when I say today's AI models are somewhat better than AI models from 2 years ago, that's an informal, subjective evaluation based on casual observation and intuition. I don't have a way to quantify intelligence. Unfortunately, no one does. 

In lieu of quantifying intelligence, I think pointing to the kinds of problems frontier AI models can't solve — problems which are easy for humans — and pointing to slow (or non-existent) progress in those areas is strong enough evidence against very near-term AGI. For example, o3 only gets 3% on ARC-AGI-2, o4-mini can't play hangman, and, after the last 2 years of progress, models are still hallucinating a lot and still struggling to understand time, causality, and other simple concepts. They have very little capacity to do hierarchical planning. There's been a little bit of improvement on these things, but not much. 

Watch the ARC-AGI-2 leaderboard (and, later on, the ARC-AGI-3 leaderboard) over the coming years. It will be a better way to quantify progress toward AGI than any other benchmark or metric I'm currently aware of, basically all of which seem almost entirely unhelpful for measuring AGI progress. (Again, with the caveat that solving ARC-AGI-2 doesn't mean a system is AGI, but failure to solve it means a system isn't AGI.) I have no idea how long it will take to solve ARC-AGI-2 (or ARC-AGI-3), but I suspect we will roll past the deadline for at least one attention-grabbing prediction of very near-term AGI before it is solved.[2]

  1. ^

    For context, read ARC Prize's blog post from March 24, 2025 announcing and explaining the ARC-AGI-2 benchmark. I also liked this video explaining ARC-AGI-2.

  2. ^

    For example, Elon Musk has absurdly predicted that AGI will be created by the end of 2025, and I wouldn't be at all surprised if on January 1, 2026, the top score on ARC-AGI-2 is still below 60%. 

Comments8


Sorted by Click to highlight new comments since:

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI)

What is your source for this? I just tried and it played hangman just fine.

I played it the other way around, where I asked o4-mini to come up with a word that I would try to guess. I tried this twice and it made the same mistake both times.

The first word was "butterfly". I guessed "B" and it said, "The letter B is not in the word."

Then, when I lost the game and o4-mini revealed the word, it said, "Apologies—I mis-evaluated your B guess earlier."

The second time around, I tried to help it by saying: "Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play."

It made the same mistake again. I guessed the letters A, E, I, O, U, and Y, and it told me none of the letters were in the word. That exhausted the number of wrong guesses I was allowed, so it ended the game and revealed the word was "schmaltziness".

This time, it didn't catch its own mistake right away. I prompted it to review the context window and check for mistakes. At that point, it said that A, E, and I are actually in the word.[1]

Related to this: François Chollet has a great talk from August 2024, which I posted here, that includes a section on some of the weird, goofy mistakes that LLMs make. 

He argues that when a new mistake or category of mistake is discovered and becomes widely known, LLM companies fine-tune their models to avoid these mistakes in the future. But if you change up the prompt a bit, you can still elicit the same kind of mistake. 

So, the fine-tuning may give the impression that LLMs' overall reasoning ability is improving, but really this is a patchwork approach that can't possibly scale to cover the space of all human reasoning, which is impossibly vast and can only be mastered through better generalization. 

  1. ^

    I edited my comment to add this footnote on 2025-05-03 at 16:33 UTC. I just checked and o4-mini got the details on this completely wrong. It said:

     

    But the final word SCHMALTZINESS actually contains an A (in position 5), an I (in positions 10 and 13), and two E’s (in positions 11 and 14).

    What it said about the A is correct. It said that one letter, I, was in two positions, and neither of the positions it gave contain an I. It said there are two Es, but there is only E. It gets the position of the E right, but says there is a second E in position 14, which doesn't exist.

Huh interesting, I just tried that direction and it worked fine as well. This isn't super important but if you wanted to share the conversation I'd be interested to see the prompt you used.

I got an error trying to look at your link:

Unable to load conversation

For the first attempt at hangman, when the word was "butterfly", the prompt I gave was just: 

Let’s play hangman. Pick a word and I’ll guess.

After o4-mini picked a word, I added: 

Also, give me a vague hint or a general category.

It said the word was an animal. 

I guessed B, it said there was no B, and at the end said the word was "butterfly".

The second time, when the word was "schmaltziness", the prompt was:

Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play.

o4-mini responded:

I’m ready to play Hangman!

I said:

Give me a clue or hint to the word and then start the game.

There were three words where the clue was so obvious I guessed the word on the first try. 

Clue: "This animal 'never forgets.'"
Answer: Elephant

Clue: "A hopping marsupial native to Australia."
Answer: Kangaroo

After kangaroo, I said:

Next time, make the word harder and the clue more vague

Clue: "A tactic hidden beneath the surface."
Answer: Subterfuge. 

A little better, but I still guessed the word right away. 

I prompted again:

Harder word, much vaguer clue

o4-mini gave the clue "A character descriptor" and this began the disastrous attempt where it said the word "schmaltziness" had no vowels. 

Fixed the link. I also tried your original prompt and it worked for me.

But interesting! The "Harder word, much vaguer clue" seems to prompt it to not actually play hangman and instead antagonistically try to post hoc create a word after each guess which makes your guess wrong. I asked "Did you come up with a word when you first told me the number of letters or are you changing it after each guess?" And it said "I picked the word up front when I told you it was 10 letters long, and I haven’t changed it since. You’re playing against that same secret word the whole time." (Despite me being able to see its reasoning trace that this is not what it's doing.) When I say I give up it says "I’m sorry—I actually lost track of the word I’d originally picked and can’t accurately reveal it now." (Because it realized that there was no word consistent with its clues, as you noted.)

So I don't think it's correct to say that it doesn't know how to play hangman. (It knows, as you noted yourself.) It just wants so badly to make you lose that it lies about the word.

There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from “Can it do it at least once, ever?” to “Does it do it reliably, every time, without fail?”.

My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.

When I played the game with o4-mini where it said the word was “butterfly” (and also said there was no “B” in the word when I guessed “B”), I didn’t prompt it to make the word hard. I just said, after it claimed to have picked the word:

"E. Also, give me a vague hint or a general category."

o4-mini said:

"It’s an animal."

So, maybe asking for a hint or a category is the thing that causes it to fail. I don’t know.

Even if I accepted the idea that the LLM “wants me to lose” (which sounds dubious to me), then it doesn’t know how to do that properly, either. In the “butterfly” example, it could, in theory, have chosen a word retroactively that filled in the blanks but didn’t conflict with any guesses it said were wrong. But it didn’t do that.

In the attempt where the word was “schmaltziness”, o4-mini’s response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.

Fortunately, we don’t have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.

Note that the old[1] o3-high that was tested on ARC-AGI-1:

  1. ^

    OpenAI have stated that the newly-released o3 is not the same one as was evaluated on ARC-AGI-1 in December

Good Lord! Thanks for this information!

The Twitter thread by Toby Ord is great. Thanks for linking that. This tweet helps put things in perspective:

For reference, these are simple puzzles that my 10-year-old child can solve in about 4 minutes.

Curated and popular this week
 ·  · 13m read
 · 
It’s to build the skills required to solve the problems that you want to solve in the world [I am a career advisor at 80,000 Hours. This post is adapted from a talk I gave on career capital to some ambitious altruistic students. If you prefer slides, you can access them here. These ideas are informed by my work at 80k but reflects my personal views.] I’m often asked how to have an impactful fulfilling career. My four word answer is “get good, be known.” My “fits on a postcard” answer is something like this: 1. Identify a problem with a vast scale of harm that is neglected at current margins and that is tractable to solve. Millions will die of preventable diseases this year, billions of animals will be tortured, severe tail risks like nuclear war and catastrophic pandemics still exist, and we might be on the cusp of a misaligned intelligence explosion. You should find an important problem to work on. 2. Obsessively improve at the rare and valuable skills to solve this problem and do so in a legible way for others to notice. Leverage this career capital to keep the flywheel going— skills allow you to solve more problems, which builds more skills. Rare and valuable roles require rare and valuable traits, so get so good they can’t ignore you. Unfortunately, some ambitious and altruistic young people that I speak to seem to have implicitly developed a model that looks more like this: 1. Identify a problem with a vast scale of harm, that is neglected at current margins, and that is tractable to solve. 2. Get a job from the 80,000 Hours job board at a capital-E capital-A Effective Altruist organization right out of college, as fast as possible, otherwise feel like a failure, oh god, oh god... I empathize with this feeling. Ambitious people who care about reducing risk and suffering in the world understandably think it’s the most important thing they can be doing, and often hold themselves to a high standard when trying to get there. Before properly entering the w
 ·  · 6m read
 · 
I really enjoyed reading the "why I donate" posts in the past week, so much so that I felt compelled to add my reflections, in case someone finds my reasons as interesting as I found theirs. 1. My money must be spent on something, might as well spend it on the most efficient things The core reason I give is something that I think is under-represented in the other posts: the money I have and earn will need to be spent on something, and it feels extremely inefficient and irrational to spend it on my future self when it can provide >100x as much to others. To me, it doesn't seem important whether I'm in the global top 10% or bottom 10%, or whether the money I have is due to my efforts or to the place I was born. If it can provide others 100x as much, it just seems inefficient/irrational to allocate it to myself. The post could end here, but there are other secondary reasons/perspectives on why I personally donate that I haven't seen commonly discussed. 2. Spending money is voting on how the global economy allocates its resources In 2017, I read Wealth: The Toxic Byproduct by Kevin Simler. Surprisingly, I don't think it has ever been posted on this forum. I disagree with some of it, but the core points really changed how I think about wealth, earning, and spending. The post is very well written and enjoyable, but it's 2400 words, so copy-pasting some snippets: > A thought experiment — the Congolese Trading Window: > > Suppose one day you wake up to find a large pile of Congolese francs. [...] A window [...] pushes open to reveal the unfamiliar sights of a Congolese outdoor market [...] a man approaches your window. [...] He's asking if you'd like to buy his grain for 500 francs. > > What should you do? [...] > Your plan is to buy grain whenever you think the price is poised to go up in the near future, and sell whenever you think the price is poised to go down. > [...] > Imagine a particular bag of grain that you bought at T1 for 200 francs, and then sold at T
 ·  · 6m read
 · 
Note: This post was crossposted from the Coefficient Giving Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- It can feel hard to help factory-farmed animals. We’re up against a trillion-dollar global industry and its army of lobbyists, marketeers, and apologists. This industry wields vast political influence in nearly every nation and sells its products to most people on earth. Against that, we are a movement of a few thousand full-time advocates operating on a shoestring. Our entire global movement — hundreds of groups combined — brings in less funds in a year than one meat company, JBS, makes in two days. And we have the bigger task. The meat industry just wants to preserve the status quo: virtually no regulation and ever-growing demand for factory farming. We want to upend it — and place humanity on a more humane path. Yet, somehow, we’re winning. After decades of installing battery cages, gestation crates, and chick macerators, the industry is now removing them. Once-dominant industries, like fur farming, are collapsing. And advocates are building momentum toward bigger reforms for all farmed animals. Here are my top ten wins from this year: 1. Liberté et Égalité, for Chickens. France’s largest chicken producer, the LDC Group, committed to adopting the European Chicken Commitment for its two flagship brands by 2028 — a shift that French advocacy group L214 estimates will cover 40% of the national chicken market, or up to 400 million birds each year. Across the Channel, British supermarket chain Waitrose transitioned all its own-brand chicken to comply with the parallel UK Better Chicken Commitment. 2. Guten Cluck! The Wurst Is Over for German Animals. Germany’s top retailer, Edeka, committed to making all of its own-brand chicken products compliant with Germany’s equivalent of the European Chicken Commitment by 2030