RE-Bench: Evaluating frontier AI R&D capabilities

web

METR·metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-l...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Published by METR (Model Evaluation and Threat Research) in November 2024, RE-Bench is a key tool for tracking when AI systems might be able to substantially accelerate their own development—a critical threshold for AI safety planning.

Metadata

Importance: 72/100blog postprimary source

Summary

METR introduces RE-Bench, a benchmark designed to evaluate the ability of frontier AI models to autonomously conduct machine learning research and development tasks. The benchmark tests models on realistic AI R&D tasks and finds that current top models achieve meaningful but limited performance compared to human experts, with performance scaling with time horizon. This work is relevant to assessing how close AI systems are to being able to accelerate their own development.

Key Points

•RE-Bench contains 7 challenging AI R&D environments requiring skills like experiment design, debugging, and optimization over extended time horizons.
•Top frontier models (e.g., o1) achieve ~34% of expert human performance on these tasks, showing meaningful but far-from-human-level R&D capability.
•Model performance degrades significantly on longer time horizons, suggesting current agents struggle with multi-step autonomous research workflows.
•The benchmark is designed to track progress toward AI systems capable of meaningfully accelerating AI research itself—a key threshold for safety.
•Results suggest we have some runway before AI can fully automate AI R&D, but the trajectory warrants close monitoring.

Cited by 2 pages

Page	Type	Quality
Self-Improvement and Recursive Enhancement	Capability	69.0
Capability Elicitation	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202630 KB

Evaluating frontier AI R&D capabilities of language model agents against human experts - METR 

 
 
 
 
 

 

 
 

 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 Research 
 

 
 
 Notes 
 

 
 
 Updates 
 

 
 
 About 
 

 
 
 Donate 
 

 
 
 Careers 
 

 

 
 
 
 Search
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 -->
 
 
 
 

 
 
 
 
 
 
 
 Research 
 

 
 
 
 
 
 
 Notes 
 

 
 
 
 
 
 
 Updates 
 

 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 Careers 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 Menu 
 
 
 

 
 Evaluating frontier AI R&D capabilities of language model agents against human experts 
 
 
 
 
 
 
 DATE

 November 22, 2024 
 
 
 
 
 SHARE

 
 
 Copy Link
 
 
 
 Citation
 
 
 
 
 BibTeX Citation 
 &times; 
 
 
 @misc { evaluating-frontier-ai-r-d-capabilities-of-language-model-agents-against-human-experts , 
 title = {Evaluating frontier AI R&D capabilities of language model agents against human experts} , 
 author = {METR} , 
 howpublished = {\url{https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/}} , 
 year = {2024} , 
 month = {11} , 
 } 
 
 Copy 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs.

 
 

 Full paper    |    Github repo 

 Each of the 7 environments in the benchmark is centered around a research engineering task, such as fitting a scaling law or optimizing a GPU kernel. The environments were selected in consultation with ML researchers in academia and top industry labs for realism and coverage. In each environment, the agent, which can be a model or a human, is given access to a computer (often with several GPUs), a scoring function (e.g., maximizing accuracy on a dataset or making a training loop run faster), and any other necessary resources. The agent then is instructed to score as high as possible within a fixed time limit.

 

 In general, we find that AI agents perform better than humans at AI research engineering tasks when both are given 2 hours, but they perform worse at higher time budgets (see the section under Results for more detail on what we mean by “time budget”). Compared to our tasks, real-world ML research often involves much larger projects using more compute over longer periods of time, so we think that the high-budget results are the most important. 

 The median AI agent attempt (when we don’t do “best-of-k”) makes very little progress in most environments, and we often observe the agents failing to react appropriately to novel information or struggling to build on their progress over time. 

 On the other hand, the AI agents appear to have significant expertise in the relevant ML topics. They generate and te

... (truncated, 30 KB total)

Resource ID: 056e0ff33675b825 | Stable ID: sid_rvS9JThCnx