Hide table of contents

The news:

ARC Prize published a blog post on April 22, 2025 that says OpenAI's o3 (Medium) scores 2.9% on the ARC-AGI-2 benchmark.[1] As of today, the leaderboard says that o3 (Medium) scores 3.0%. The blog post says o4-mini (Medium) scores 2.3% on ARC-AGI-2 and the leaderboard says it scores 2.4%. 

The high-compute versions of the models "failed to respond or timed out" for the large majority of tasks.

The average score for humans — typical humans off the street — is 60%. All of the ARC-AGI-2 tasks have been solved by at least two humans in no more than two attempts.

From the recent blog post:         

ARC Prize Foundation is a nonprofit committed to serving as the North Star for AGI by building open reasoning benchmarks that highlight the gap between what’s easy for humans and hard for AI. The ARC‑AGI benchmark family is our primary tool to do this. Every major model we evaluate adds new datapoints to the community’s understanding of where the frontier stands and how fast it is moving.

In this post we share the first public look at how OpenAI’s newest o‑series models, o3 and o4‑mini, perform on ARC‑AGI.

Our testing shows:

  • o3 performs well on ARC-AGI-1 - o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set, and the o3-medium reached 53%. Neither surpassed 3% on ARC‑AGI‑2.
  • o4-mini shows promise - o4-mini-low scored 21% on ARC-AGI-1 Semi Private Eval, and o4-mini-medium` scored 41% at state of the art levels of efficiency. Again, both low/med scored under 3% on the more difficult ARC-AGI-2 set.
  • Incomplete coverage with high reasoning - Both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning. Partial high‑reasoning results appear below. However, these runs were excluded from the leaderboard due to insufficient coverage.

My analysis:

This is clear evidence that cutting-edge AI models have far less than human-level general intelligence. 

To be clear, scoring at human-level or higher on ARC-AGI-2 isn't evidence of human-level general intelligence and isn't intended to be. It's simply meant to be a challenging benchmark for AI models that attempts to measure models' ability to generalize to novel problems, rather than to rely on memorization to solve problems. 

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI), but if o5-mini or a future version of o4-mini is able to play hangman, that wouldn't be a sign that it is AGI.

This is also conclusive disconfirmation (as if we needed it!) of the economist Tyler Cowen's declaration that o3 is AGI. (He followed up a day later and said, "I don’t mind if you don’t want to call it AGI." But he didn't say he was wrong to call it AGI.) 

It is inevitable that over the next 5 years, many people will realize their belief that AGI will be created within the next 5 years is wrong. (Though not necessarily all, since, as Tyler Cowen showed, it is possible to declare that an AI model is AGI when it is clearly not. To avoid admitting to being wrong, in 2027 or 2029 or 2030 or whenever they predicted AGI would happen, people can just declare the latest AI model from that year to be AGI.) ARC-AGI-2 and, later on, ARC-AGI-3 can serve as a clear reminder that frontier AI models are not AGI, are not close to AGI, and continue to struggle with relatively simple problems that are easy for humans. 

If you imagine fast enough progress, then no matter how far current AI systems are from AGI, it's possible to imagine them progressing from the current level of capabilities to AGI in incredibly small spans of time. But there is no reason to think progress will be fast enough to cover the ground from o3 (or any other frontier AI model) to AGI within 5 years. 

The models that exist today are somewhat better than the models that existed 2 years ago, but only somewhat. In 2 years, the models will probably be somewhat better than today, but only somewhat. 

It's hard to quantify general intelligence in a way that allows apples-to-apples comparisons between humans and machines. If we measure general intelligence by measuring the ability to play grandmaster-level chess, well, IBM's Deep Blue could do that in 1996. If we give ChatGPT an IQ test, it will score well above 100, the average for humans. Large language models (LLMs) are good at taking written tests and exams, which is what a lot of popular benchmarks are. 

So, when I say today's AI models are somewhat better than AI models from 2 years ago, that's an informal, subjective evaluation based on casual observation and intuition. I don't have a way to quantify intelligence. Unfortunately, no one does. 

In lieu of quantifying intelligence, I think pointing to the kinds of problems frontier AI models can't solve — problems which are easy for humans — and pointing to slow (or non-existent) progress in those areas is strong enough evidence against very near-term AGI. For example, o3 only gets 3% on ARC-AGI-2, o4-mini can't play hangman, and, after the last 2 years of progress, models are still hallucinating a lot and still struggling to understand time, causality, and other simple concepts. They have very little capacity to do hierarchical planning. There's been a little bit of improvement on these things, but not much. 

Watch the ARC-AGI-2 leaderboard (and, later on, the ARC-AGI-3 leaderboard) over the coming years. It will be a better way to quantify progress toward AGI than any other benchmark or metric I'm currently aware of, basically all of which seem almost entirely unhelpful for measuring AGI progress. (Again, with the caveat that solving ARC-AGI-2 doesn't mean a system is AGI, but failure to solve it means the system isn't AGI.) I have no idea how long it will take to solve ARC-AGI-2 (or ARC-AGI-3), but I suspect we will roll past the deadline for at least one attention-grabbing prediction of very near-term AGI before it is solved.

  1. ^

    For context, read ARC Prize's blog post from March 24, 2025 announcing and explaining the ARC-AGI-2 benchmark. I also liked this video explaining ARC-AGI-2.

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 23m read
 · 
Or on the types of prioritization, their strengths, pitfalls, and how EA should balance them   The cause prioritization landscape in EA is changing. Prominent groups have shut down, others have been founded, and everyone is trying to figure out how to prepare for AI. This is the first in a series of posts examining the state of cause prioritization and proposing strategies for moving forward.   Executive Summary * Performing prioritization work has been one of the main tasks, and arguably achievements, of EA. * We highlight three types of prioritization: Cause Prioritization, Within-Cause (Intervention) Prioritization, and Cross-Cause (Intervention) Prioritization. * We ask how much of EA prioritization work falls in each of these categories: * Our estimates suggest that, for the organizations we investigated, the current split is 89% within-cause work, 2% cross-cause, and 9% cause prioritization. * We then explore strengths and potential pitfalls of each level: * Cause prioritization offers a big-picture view for identifying pressing problems but can fail to capture the practical nuances that often determine real-world success. * Within-cause prioritization focuses on a narrower set of interventions with deeper more specialised analysis but risks missing higher-impact alternatives elsewhere. * Cross-cause prioritization broadens the scope to find synergies and the potential for greater impact, yet demands complex assumptions and compromises on measurement. * See the Summary Table below to view the considerations. * We encourage reflection and future work on what the best ways of prioritizing are and how EA should allocate resources between the three types. * With this in mind, we outline eight cruxes that sketch what factors could favor some types over others. * We also suggest some potential next steps aimed at refining our approach to prioritization by exploring variance, value of information, tractability, and the
 ·  · 1m read
 · 
I wanted to share a small but important challenge I've encountered as a student engaging with Effective Altruism from a lower-income country (Nigeria), and invite thoughts or suggestions from the community. Recently, I tried to make a one-time donation to one of the EA-aligned charities listed on the Giving What We Can platform. However, I discovered that I could not donate an amount less than $5. While this might seem like a minor limit for many, for someone like me — a student without a steady income or job, $5 is a significant amount. To provide some context: According to Numbeo, the average monthly income of a Nigerian worker is around $130–$150, and students often rely on even less — sometimes just $20–$50 per month for all expenses. For many students here, having $5 "lying around" isn't common at all; it could represent a week's worth of meals or transportation. I personally want to make small, one-time donations whenever I can, rather than commit to a recurring pledge like the 10% Giving What We Can pledge, which isn't feasible for me right now. I also want to encourage members of my local EA group, who are in similar financial situations, to practice giving through small but meaningful donations. In light of this, I would like to: * Recommend that Giving What We Can (and similar platforms) consider allowing smaller minimum donation amounts to make giving more accessible to students and people in lower-income countries. * Suggest that more organizations be added to the platform, to give donors a wider range of causes they can support with their small contributions. Uncertainties: * Are there alternative platforms or methods that allow very small one-time donations to EA-aligned charities? * Is there a reason behind the $5 minimum that I'm unaware of, and could it be adjusted to be more inclusive? I strongly believe that cultivating a habit of giving, even with small amounts, helps build a long-term culture of altruism — and it would
 ·  · 14m read
 · 
Introduction In this post, I present what I believe to be an important yet underexplored argument that fundamentally challenges the promise of cultivated meat. In essence, there are compelling reasons to conclude that cultivated meat will not replace conventional meat, but will instead primarily compete with other alternative proteins that offer superior environmental and ethical benefits. Moreover, research into and promotion of cultivated meat may potentially result in a net negative impact. Beyond critique, I try to offer constructive recommendations for the EA movement. While I've kept this post concise, I'm more than willing to elaborate on any specific point upon request. Finally, I contacted a few GFI team members to ensure I wasn't making any major errors in this post, and I've tried to incorporate some of their nuances in response to their feedback. From industry to academia: my cultivated meat journey I'm currently in my fourth year (and hopefully final one!) of my PhD. My thesis examines the environmental and economic challenges associated with alternative proteins. I have three working papers on cultivated meat at various stages of development, though none have been published yet. Prior to beginning my doctoral studies, I spent two years at Gourmey, a cultivated meat startup. I frequently appear in French media discussing cultivated meat, often "defending" it in a media environment that tends to be hostile and where misinformation is widespread. For a considerable time, I was highly optimistic about cultivated meat, which was a significant factor in my decision to pursue doctoral research on this subject. However, in the last two years, my perspective regarding cultivated meat has evolved and become considerably more ambivalent. Motivations and epistemic status Although the hype has somewhat subsided and organizations like Open Philanthropy have expressed skepticism about cultivated meat, many people in the movement continue to place considerable hop