When I posted this thread about how o3's extreme costs make it less impressive than it first appears, many people told me that this wasn't an issue as the price would quickly come down. I checked in on it today, and the price has gone up by 10x. 1/n

Toby Ord
@tobyordoxford

Jan 21

Inference Scaling and the Log-x Chart: 2024 saw a switch in focus from scaling up the compute used to train frontier AI models to scaling up the compute used to run them. How well is this inference scaling going? 1/

Here is the revised ARC-AGI plot. They've increased their cost-estimate of the original o3 low from $20 per task to $200 per task. Presumably o3 high has gone from $3,000 to $30,000 per task, which is why it breaks their $10,000 per task limit and is no longer included. 2/n

The ARC-AGI team found that the o3 price estimate was only a tenth of what OpenAI were charging for the inferior o1-pro model, so updated the price estimate to use the price of o1-pro until the actual o3 is released and its true price is known. 3/n

Most importantly, In my earlier thread, I noted that the o3 models looked like they had broken the existing trend of the o1 models (i.e. still showing only logarithmic returns, but with a better constant). But on the updated chart, o3 is barely above the o1 trend: 4/n

And that could easily be explained by the fact that o3 was explicitly trained on 75% of the public test set for ARC-AGI (OpenAI hasn't released any ablation results to show how much of the gain came from that). 5/n

So I'm less impressed than I was before about the step from o1 to o3. But I am more impressed by o3-mini, which uses 1,000x less compute and has results which really do break the trend. Kudos to the @ARC Prize team for highlighting AI efficiency in this way. 6/n

Finally, I want to note how preposterous the o3-high attempt was. It took 1,024 attempts at each task, writing about 137 pages of text for each attempt, or about 43 million words total. That's writing an Encyclopedia Brittanica (44 million words) per task! 7/n

And costing about $30,000 for each task. For reference, these are simple puzzles that my 10-year-old child can solve in about 4 minutes. That's *something* but not how intelligence solves the puzzle. 8/8

This is not 'most of the qualitative new capability'. o1 was getting 30.7% and o3-preview-low got 76% — a boost of 45 percentage points. But o3-released-med is a boost of 22 percentage points on o1, which is not most of the gains.

I'd say it is getting somewhere around half the performance boost of o1 -> o3-preview but what's impressive is that it is much cheaper (even than OpenAI's cost estimate for o3-preview).

Also worth noting that 4 months have passed since the o3-preview release, so these costs for o3 now aren't necessarily a great estimate for the cost of o3-preview back then.

Toby Ord

@tobyordoxford

Senior Researcher at Oxford University. Author — The Precipice: Existential Risk and the Future of Humanity.

Follow on 𝕏

twitter-thread.com/t/1907379650831015964

Toby Ord@tobyordoxford

When I posted this thread about how o3's extreme costs make it less impressive than it first appears, many people told me that this wasn't an issue as the price would quickly come down. I checked in on it today, and the price has gone *up* by 10x. 1/n

Toby Ord@tobyordoxford

Toby Ord

Toby Ord
@tobyordoxford

When I posted this thread about how o3's extreme costs make it less impressive than it first appears, many people told me that this wasn't an issue as the price would quickly come down. I checked in on it today, and the price has gone up by 10x. 1/n

Toby Ord
@tobyordoxford