[2311.12983] GAIA: a benchmark for General AI Assistants

paper

2023·arXiv·arxiv.org/abs/2311.12983

Authors

Grégoire Mialon·Clémentine Fourrier·Craig Swift·Thomas Wolf·Yann LeCun·Thomas Scialom

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

GAIA is widely cited as evidence that current AI systems lack robust general-purpose capabilities despite excelling on narrow professional benchmarks, making it relevant to both capabilities assessment and realistic AI safety evaluations.

Paper Details

Citations

609

128 influential

Year

2023

arXiv:2311.12983 DOI:10.48550/arXiv.2311.12983 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

Summary

GAIA introduces a benchmark of 466 real-world questions requiring reasoning, multimodal handling, web browsing, and tool use, revealing a stark performance gap: humans achieve 92% accuracy while GPT-4 with plugins achieves only 15%. The paper argues that AGI development should prioritize human-level robustness on practical everyday tasks rather than superhuman performance on narrow professional domains.

Key Points

•Humans score 92% vs GPT-4 with plugins at 15%, despite questions being conceptually simple for humans but requiring multi-step tool use and reasoning.
•Questions are designed to be unambiguous, unlikely to appear in training data verbatim, and span three difficulty levels requiring increasing capability integration.
•Challenges the trend of benchmarking AI on ever-harder specialized tasks (law, chemistry), arguing practical robustness is the better AGI milestone.
•Released as a public leaderboard on HuggingFace with 166 annotated questions and 300 held-out for evaluation integrity.
•Highlights a critical gap in current AI assistant capabilities relevant to real-world deployment safety and reliability assessments.

Cited by 1 page

Page	Type	Quality
Tool Use and Computer Use	Capability	67.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202684 KB

[2311.12983] GAIA:A Benchmark for General AI Assistants 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 1]FAIR, Meta
2]HuggingFace
3]AutoGPT
4]GenAI, Meta

 
 

 
 GAIA : 
 A Benchmark for G eneral AI A ssistants 
 

 
 
 Grégoire Mialon
 
    
 Clémentine Fourrier
 
    
 Craig Swift
 
    
 Thomas Wolf
 
    
 Yann LeCun
 
    
 Thomas Scialom
 
 [
 
 [
 
 [
 
 [
 
 {gmialon,tscialom}@meta.com 
 
 clementine@huggingface.co 
 
 

 
 Abstract

 We introduce GAIA , a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.
 GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins.
This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry.
 GAIA ’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans.
We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions. Using GAIA ’s methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board hereby accessible .

 
 
 \correspondence 
 , 
 \metadata [Code] https://huggingface.co/gaia-benchmark 

 
 
 
 1 Introduction

 
 
 Level 1 
 Question: 
What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website? 
 Ground truth: 90 
 Level 2 
 
 
 
 Question: If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place. 
 Ground truth: +4.6 
 
 Level 3 
 Question: 
In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes. 
 Ground truth: White; 5876 
 
 Figure 1: Sample GAIA questions. Completing the tasks requires fundamental abilities such as reasoning, multi-modality handling, or tool use proficiency. Answers are unambiguous and by design unlikely to be found in pla

... (truncated, 84 KB total)

Resource ID: 1c294c3f51d7bc1f | Stable ID: sid_0xVUEahGRk