Back
ARC-AGI-2 Benchmark
webarcprize.org·arcprize.org/arc-agi/2/
ARC-AGI-2 is a leading benchmark for measuring progress toward general reasoning and AGI; relevant for AI safety researchers tracking the gap between human and AI reasoning capabilities and evaluating whether scaling laws are sufficient.
Metadata
Importance: 72/100tool pagedataset
Summary
ARC-AGI-2 is a 2025 benchmark designed to stress-test AI reasoning systems, where pure LLMs score 0% and frontier reasoning systems achieve only single-digit percentages despite humans solving all tasks. It targets three core capability gaps—symbolic interpretation, compositional reasoning, and contextual rule application—demonstrating that scaling alone is insufficient and new architectures or test-time adaptation methods are required.
Key Points
- •Pure LLMs score 0% and best AI reasoning systems score only single-digit percentages; all tasks are solvable by humans, highlighting a major capability gap.
- •Benchmark targets three specific weaknesses: symbolic interpretation, compositional reasoning, and contextual rule application.
- •Demonstrates log-linear scaling is insufficient; new test-time adaptation algorithms or novel AI architectures are required.
- •Dataset includes 1,000 training tasks and 360 calibrated evaluation tasks split across public, semi-private, and private sets; challenge goal is 85% accuracy.
- •Successor to ARC-AGI-1 (2019), which saw little progress for 5 years until test-time adaptation methods emerged in late 2024.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
Cached Content Preview
HTTP 200Fetched Mar 15, 20266 KB
# ARC-AGI-2
## 2025 - Challenges Reasoning Models
#### Links
#### Versions
## Introducing ARC-AGI-2
A new benchmark that challenges frontier AI reasoning systems.
ARC-AGI-1 was created in 2019 (before LLMs even existed). It endured 5 years of global competitions, over 50,000x of AI scaling, and saw little progress until late 2024 with test-time adaptation methods pioneered by [ARC Prize 2024](https://arcprize.org/2024-results) and [OpenAI](https://arcprize.org/arc-agi/2/blog/oai-o3-pub-breakthrough).
ARC-AGI-2 - the next iteration of the benchmark - is designed to stress test the **efficiency** and **capability** of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.
Pure LLMs score 0%, AI reasoning systems score only single-digit percentages, yet extensive testing shows that humans can solve every task.
Can you create a system that can reach 85% accuracy?
[Learn more](https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025)
Efficiency Test
#### ARC-AGI-2: Scale is Not Enough
Log-linear scaling is insufficient to beat ARC-AGI-2.
New test-time adaptation algorithms or novel AI systems are needed to
bring AI efficiency inline with human performance.

Capability Test
#### ARC-AGI-2: Symbolic Interpretation
Tasks requiring symbols to be interpreted as having meaning beyond their visual patterns.
Current systems attempt to check symmetry, mirroring, and other transformations, and even recognize
connecting elements, but fail to assign semantic significance to the symbols themselves.
[Try this task](https://arcprize.org/play?task=e3721c99)
→
→
Capability Test
#### ARC-AGI-2: Compositional Reasoning
Tasks requiring simultaneous application of a rules, or application of multiples rules that
interact with each other.
In contrast, if a task has very few global rules, current systems can consitently discover and can apply
them.
[Try this task](https://arcprize.org/play?task=cbebaa4b)
→
→
Capability Test
#### ARC-AGI-2: Contextual Rule Application
Tasks where rules must be applied differently based on context.
Systems tend to fixate on superficial patterns rather than understanding the underlying selection principles.
[Try this task](https://arcprize.org/play?task=b5ca7ac4)
→
→
### Dataset Structure
| | | |
| --- | --- | --- |
| **Dataset** | **Tasks** | **Description** |
| [Training Set](https://github.com/arcprize/ARC-AGI-2/tree/main/data/training) | 1000 tasks | Uncalibrated, public, a spectrum of difficulty ranging from very easy to very difficult for both humans and AI, designed to expose and teach Core Knowledge Priors, use to train your systems. |
| [Public Eval Set](https://github.com/arcprize/ARC-AGI-2/tree/main/data/evaluation) | 120 tasks | Calibrated, public, all tasks solved pass@2 by at least two humans, use to test your
... (truncated, 6 KB total)Resource ID:
28167998c7d9c6b2 | Stable ID: MDEyOWZjNW