o3 scores 87.5% on ARC-AGI

web

arcprize.org·arcprize.org/blog/oai-o3-pub-breakthrough

Landmark announcement by ARC Prize documenting o3's surprising performance on ARC-AGI-1, widely cited in AI safety and capabilities discussions as evidence of a qualitative shift in AI reasoning abilities as of late 2024.

Metadata

Importance: 82/100blog postprimary source

Summary

François Chollet reports that OpenAI's o3 model scored 87.5% on the ARC-AGI-1 Semi-Private Evaluation set using high compute (1024 samples), and 75.7% under the $10k budget constraint, representing a dramatic step-function improvement over previous AI systems. This result challenges prior intuitions about AI capabilities, as ARC-AGI-1 took four years to progress from 0% with GPT-3 to only 5% with GPT-4o. The post also announces ARC-AGI-2 and ARC Prize 2025 as next-generation benchmarks targeting AGI progress.

Key Points

•o3 scored 75.7% on ARC-AGI Semi-Private Eval within $10k compute budget and 87.5% at 172x higher compute (~$456k).
•This is a massive jump from GPT-4o's ~5%, representing a qualitative shift in novel task adaptation ability not seen before in GPT-family models.
•o3 was trained on 75% of the ARC-AGI Public Training set; impact of this fine-tuning on results has not been fully disentangled.
•High-compute configuration cost ~$4,560 per task, highlighting that compute efficiency is now a critical metric alongside raw benchmark scores.
•ARC-AGI-2 and ARC Prize 2025 were announced to maintain a rigorous AGI benchmark as o3 approaches saturation of ARC-AGI-1.

Cited by 5 pages

Page	Type	Quality
Large Language Models	Concept	62.0
Reasoning and Planning	Capability	65.0
Self-Improvement and Recursive Enhancement	Capability	69.0
Is Scaling All You Need?	Crux	42.0
Emergent Capabilities	Risk	61.0

Cached Content Preview

HTTP 200Fetched May 17, 202615 KB

![François Chollet](https://arcprize.org/media/images/blog-fran%C3%A7ois-chollet.jpg)

By [François Chollet](https://fchollet.com/)

Published 20 Dec 2024

# OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

OpenAI has released a new version of o3. [Read our analysis](https://arcprize.org/blog/analyzing-o3-with-arc-agi) to learn how it differs from the preview below.

Updated (April 16, 2025): OpenAI has [officially released o3](https://openai.com/index/o3-o4-mini-system-card/). OpenAI has confirmed that this version is not the same as the one we tested in this original post. See [more information](https://x.com/arcprize/status/1912567067024453926) on this. We will publish updated results for released o3 shortly.

OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough **75.7%** on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored **87.5%**.

![o Series Performance](https://arcprize.org/media/images/blog/o-series-performance.jpg)

This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3.

The mission of ARC Prize goes beyond our first benchmark: to be a North Star towards AGI. And we're excited to be working with the OpenAI team and others next year to continue to design next-gen, enduring AGI benchmarks.

ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. We're committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% is created.

Read on for the full testing report.

* * *

## OpenAI o3 ARC-AGI Results

> Update 12/20/2024: ARC Prize presented o3's performance results in person with OpenAI's Sam Altman (CEO) and Mark Chen (SVP Research) during the final "12 Days of OpenAI" event. Watch the recording [here](https://www.youtube.com/watch?v=SKBG1sqdyIU&t=302s).

We tested o3 against two ARC-AGI datasets:

- **Semi-Private Eval**: 100 private tasks used to assess overfitting
- **Public Eval**: 400 public tasks

At OpenAI's direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).

Here are the results.

| Set | Tasks | Efficiency | Score | Retail Cost\* | Samples | Tokens | Cost/Task\* | Time/Task (mins) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Semi-Private | 100 | High | 75.7% | $2,680 | 6 | 33.5M | $26 | 1.3 |
| Semi-Private | 100 | Low | 87.5% | $456,000 | 1024 | 5.7B | $4,560 | 13.8 |
| Public | 400 | High | 82.8% | $66,772 | 6 | 111M | $167 | N/A |
| Public | 400 | Low | 91.5% | $760,000 | 1024 | 9.5B | $1,900 | N/A |

\\* Note 3/24/2025: o3 pricing costs have been u

... (truncated, 15 KB total)

Resource ID: 457fa3b0b79d8812 | Stable ID: sid_i6eVuYOna8