Back
o3 scores 87.5% on ARC-AGI
webarcprize.org·arcprize.org/blog/oai-o3-pub-breakthrough
Landmark announcement by ARC Prize documenting o3's surprising performance on ARC-AGI-1, widely cited in AI safety and capabilities discussions as evidence of a qualitative shift in AI reasoning abilities as of late 2024.
Metadata
Importance: 82/100blog postprimary source
Summary
François Chollet reports that OpenAI's o3 model scored 87.5% on the ARC-AGI-1 Semi-Private Evaluation set using high compute (1024 samples), and 75.7% under the $10k budget constraint, representing a dramatic step-function improvement over previous AI systems. This result challenges prior intuitions about AI capabilities, as ARC-AGI-1 took four years to progress from 0% with GPT-3 to only 5% with GPT-4o. The post also announces ARC-AGI-2 and ARC Prize 2025 as next-generation benchmarks targeting AGI progress.
Key Points
- •o3 scored 75.7% on ARC-AGI Semi-Private Eval within $10k compute budget and 87.5% at 172x higher compute (~$456k).
- •This is a massive jump from GPT-4o's ~5%, representing a qualitative shift in novel task adaptation ability not seen before in GPT-family models.
- •o3 was trained on 75% of the ARC-AGI Public Training set; impact of this fine-tuning on results has not been fully disentangled.
- •High-compute configuration cost ~$4,560 per task, highlighting that compute efficiency is now a critical metric alongside raw benchmark scores.
- •ARC-AGI-2 and ARC Prize 2025 were announced to maintain a rigorous AGI benchmark as o3 approaches saturation of ARC-AGI-1.
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Concept | 62.0 |
| Reasoning and Planning | Capability | 65.0 |
| Self-Improvement and Recursive Enhancement | Capability | 69.0 |
| Is Scaling All You Need? | Crux | 42.0 |
| Emergent Capabilities | Risk | 61.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202615 KB

By [François Chollet](https://fchollet.com/)
Published 20 Dec 2024
# OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
OpenAI has released a new version of o3. [Read our analysis](https://arcprize.org/blog/analyzing-o3-with-arc-agi) to learn how it differs from the preview below.
Updated (April 16, 2025): OpenAI has [officially released o3](https://openai.com/index/o3-o4-mini-system-card/). OpenAI has confirmed that this version is not the same as the one we tested in this original post. See [more information](https://x.com/arcprize/status/1912567067024453926) on this. We will publish updated results for released o3 shortly.
OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough **75.7%** on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored **87.5%**.

This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3.
The mission of ARC Prize goes beyond our first benchmark: to be a North Star towards AGI. And we're excited to be working with the OpenAI team and others next year to continue to design next-gen, enduring AGI benchmarks.
ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. We're committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% is created.
Read on for the full testing report.
* * *
## OpenAI o3 ARC-AGI Results
> Update 12/20/2024: ARC Prize presented o3's performance results in person with OpenAI's Sam Altman (CEO) and Mark Chen (SVP Research) during the final "12 Days of OpenAI" event. Watch the recording [here](https://www.youtube.com/watch?v=SKBG1sqdyIU&t=302s).
We tested o3 against two ARC-AGI datasets:
- **Semi-Private Eval**: 100 private tasks used to assess overfitting
- **Public Eval**: 400 public tasks
At OpenAI's direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).
Here are the results.
| Set | Tasks | Efficiency | Score | Retail Cost\* | Samples | Tokens | Cost/Task\* | Time/Task (mins) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Semi-Private | 100 | High | 75.7% | $2,680 | 6 | 33.5M | $26 | 1.3 |
| Semi-Private | 100 | Low | 87.5% | $456,000 | 1024 | 5.7B | $4,560 | 13.8 |
| Public | 400 | High | 82.8% | $66,772 | 6 | 111M | $167 | N/A |
| Public | 400 | Low | 91.5% | $760,000 | 1024 | 9.5B | $1,900 | N/A |
\\* Note 3/24/2025: o3 pricing costs have been updated to use o1
... (truncated, 15 KB total)Resource ID:
457fa3b0b79d8812 | Stable ID: OTVmZjBjMz