Skip to content
Longterm Wiki
Back

[2311.12983] GAIA: a benchmark for General AI Assistants

paper

Authors

Grégoire Mialon·Clémentine Fourrier·Craig Swift·Thomas Wolf·Yann LeCun·Thomas Scialom

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

GAIA is widely cited as evidence that current AI systems lack robust general-purpose capabilities despite excelling on narrow professional benchmarks, making it relevant to both capabilities assessment and realistic AI safety evaluations.

Paper Details

Citations
609
128 influential
Year
2023

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

Summary

GAIA introduces a benchmark of 466 real-world questions requiring reasoning, multimodal handling, web browsing, and tool use, revealing a stark performance gap: humans achieve 92% accuracy while GPT-4 with plugins achieves only 15%. The paper argues that AGI development should prioritize human-level robustness on practical everyday tasks rather than superhuman performance on narrow professional domains.

Key Points

  • Humans score 92% vs GPT-4 with plugins at 15%, despite questions being conceptually simple for humans but requiring multi-step tool use and reasoning.
  • Questions are designed to be unambiguous, unlikely to appear in training data verbatim, and span three difficulty levels requiring increasing capability integration.
  • Challenges the trend of benchmarking AI on ever-harder specialized tasks (law, chemistry), arguing practical robustness is the better AGI milestone.
  • Released as a public leaderboard on HuggingFace with 166 annotated questions and 300 held-out for evaluation integrity.
  • Highlights a critical gap in current AI assistant capabilities relevant to real-world deployment safety and reliability assessments.

Cited by 1 page

PageTypeQuality
Tool Use and Computer UseCapability67.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202695 KB
1\]FAIR, Meta
2\]HuggingFace
3\]AutoGPT
4\]GenAI, Meta

# GAIA:A Benchmark for General AIAssistants

Grégoire Mialon
Clémentine Fourrier
Craig Swift
Thomas Wolf
Yann LeCun
Thomas Scialom
\[\
\[\
\[\
\[\
[{gmialon,tscialom}@meta.com](mailto:%7Bgmialon,tscialom%7D@meta.com)[clementine@huggingface.co](mailto:clementine@huggingface.co)\
\
###### Abstract\
\
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.\
GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins.\
This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.\
GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans.\
We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions. Using GAIA’s methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board [hereby accessible](https://huggingface.co/gaia-benchmark "").\
\
\\correspondence\
\
,\
\\metadata\[Code\] [https://huggingface.co/gaia-benchmark](https://huggingface.co/gaia-benchmark "")\
\
## 1 Introduction\
\
Level 1Question:\
What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?Ground truth: 90Level 2![Refer to caption](https://ar5iv.labs.arxiv.org/html/2311.12983/assets/figures/ice_cream.jpg)Question: If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place.Ground truth: +4.6Level 3Question:\
In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes.Ground truth: White; 5876Figure 1: Sample GAIA questions. Completing the tasks requires fundamental abilities such as reasoning, multi-modality handling, or tool use proficiency. Answers are unamb

... (truncated, 95 KB total)
Resource ID: 1c294c3f51d7bc1f | Stable ID: ZjMwZmIwMD