Skip to content
Longterm Wiki
Back

The upcoming ARC-AGI-2 benchmark

web

ARC-AGI-2 is a key benchmark in the AI capabilities landscape, relevant to AI safety researchers tracking progress toward AGI and assessing when transformative AI milestones might be reached.

Metadata

Importance: 62/100homepage

Summary

ARC-AGI-2 is an updated benchmark designed to test general fluid intelligence in AI systems, building on the original ARC-AGI challenge. It aims to create harder, more meaningful tests that current AI systems struggle with, providing a more rigorous evaluation of progress toward human-level general reasoning. The benchmark is intended to resist dataset contamination and measure genuine generalization rather than pattern-matching.

Key Points

  • ARC-AGI-2 is a successor to the original ARC-AGI benchmark, designed to be more challenging and resistant to current AI approaches
  • The benchmark tests for fluid intelligence and novel problem-solving rather than memorized patterns, making it harder to game with large-scale training
  • It serves as a measurable proxy for progress toward AGI by targeting skills current systems lack despite broad capabilities
  • The prize competition incentivizes the AI research community to develop genuinely general reasoning systems
  • ARC-AGI benchmarks are seen as important calibration tools for understanding how close AI systems are to human-level general intelligence

Cited by 1 page

PageTypeQuality
Reasoning and PlanningCapability65.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20264 KB
# ARC-AGI-1

## 2019 - Challenges Deep Learning

#### Links

#### Versions

### About

The **Abstraction and Reasoning Corpus (ARC-AGI-1)**, first introduced in 2019 by François Chollet, debuted in his paper [On the Measure Of Intelligence](https://arxiv.org/abs/1911.01547). Chollet, a prominent Google AI researcher and creator of the deep learning library Keras, developed ARC-AGI-1 specifically as a novel benchmark designed to test machine reasoning and general problem-solving skills.

![ARC-AGI-1 Task](https://arcprize.org/media/images/arc-task-grids.jpg)ARC-AGI-1 Task [(#3aa6fb7a)](https://arcprize.org/play?task=3aa6fb7a)

At the time of its launch, there was growing recognition that deep learning methods excelled in narrow, specialized tasks but fell short in demonstrating human-like generalization. ARC-AGI-1 was a direct response to this gap, aimed at evaluating AI's capability to handle novel, unforeseen problems, situations it had not been explicitly trained on. For further reading on this, see the [ARC Prize 2024 Technical Report](https://arxiv.org/html/2412.04604v1#:~:text=Fran%C3%A7ois%20Chollet%20first%20wrote%20about,19).

Motivated by the need for a true measure of AGI, ARC-AGI-1 functions as an "AGI yardstick," focusing on benchmarking the skill-acquisition capability (the fundamental core of intelligence) rather than performance on any single, predefined task. It specifically assesses how efficiently an AI can learn and generalize from minimal information, reflecting a fundamental characteristic of human intelligence.

ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a [small number of example input-output pairs](https://pgpbpadilla.github.io/chollet-arc-challenge#:~:text=Learning%20from%20a%20few%20examples) (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.

|     |     |     |
| --- | --- | --- |
| **Dataset** | **Tasks** | **Description** |
| [Training Set](https://github.com/fchollet/ARC-AGI/tree/master/data/training) | 400 tasks | A training set dedicated as a playground to train your system |
| [Public Eval Set](https://github.com/fchollet/ARC-AGI/tree/master/data/evaluation) | 400 tasks | Used to evaluate your final algorithm. |
| [Semi-Private Eval Set](https://arcprize.org/guide#semi-private) | 100 tasks | Introduced in mid-2024, this set of 100 tasks was hand selected to use as a semi-private hold out set when testing closed source models. |
| [Private Eval Set](https://arcprize.org/guide#private) | 100 tasks | Used as the basis of the ARC Prize competition. Determined final leaderboard in 2020, 2022, 2023, and 2024. |

From its introduction in 2019 until late 2024, ARC-AGI remained unsolved by AI systems, maintaining its reputation as one of the [tough

... (truncated, 4 KB total)
Resource ID: f724250a86e94673 | Stable ID: YWQ4OGQ2Nj