RE-Bench: Evaluating frontier AI R&D capabilities
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
Published by METR (Model Evaluation and Threat Research) in November 2024, RE-Bench is a key tool for tracking when AI systems might be able to substantially accelerate their own development—a critical threshold for AI safety planning.
Metadata
Summary
METR introduces RE-Bench, a benchmark designed to evaluate the ability of frontier AI models to autonomously conduct machine learning research and development tasks. The benchmark tests models on realistic AI R&D tasks and finds that current top models achieve meaningful but limited performance compared to human experts, with performance scaling with time horizon. This work is relevant to assessing how close AI systems are to being able to accelerate their own development.
Key Points
- •RE-Bench contains 7 challenging AI R&D environments requiring skills like experiment design, debugging, and optimization over extended time horizons.
- •Top frontier models (e.g., o1) achieve ~34% of expert human performance on these tasks, showing meaningful but far-from-human-level R&D capability.
- •Model performance degrades significantly on longer time horizons, suggesting current agents struggle with multi-step autonomous research workflows.
- •The benchmark is designed to track progress toward AI systems capable of meaningfully accelerating AI research itself—a key threshold for safety.
- •Results suggest we have some runway before AI can fully automate AI R&D, but the trajectory warrants close monitoring.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Self-Improvement and Recursive Enhancement | Capability | 69.0 |
| Capability Elicitation | Approach | 91.0 |
Cached Content Preview
Evaluating frontier AI R&D capabilities of language model agents against human experts - METR
Research
Notes
Updates
About
Donate
Careers
Search
-->
Research
Notes
Updates
About
Donate
Careers
Menu
Evaluating frontier AI R&D capabilities of language model agents against human experts
DATE
November 22, 2024
SHARE
Copy Link
Citation
BibTeX Citation
×
@misc { evaluating-frontier-ai-r-d-capabilities-of-language-model-agents-against-human-experts ,
title = {Evaluating frontier AI R&D capabilities of language model agents against human experts} ,
author = {METR} ,
howpublished = {\url{https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/}} ,
year = {2024} ,
month = {11} ,
}
Copy
We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs.
Full paper | Github repo
Each of the 7 environments in the benchmark is centered around a research engineering task, such as fitting a scaling law or optimizing a GPU kernel. The environments were selected in consultation with ML researchers in academia and top industry labs for realism and coverage. In each environment, the agent, which can be a model or a human, is given access to a computer (often with several GPUs), a scoring function (e.g., maximizing accuracy on a dataset or making a training loop run faster), and any other necessary resources. The agent then is instructed to score as high as possible within a fixed time limit.
In general, we find that AI agents perform better than humans at AI research engineering tasks when both are given 2 hours, but they perform worse at higher time budgets (see the section under Results for more detail on what we mean by “time budget”). Compared to our tasks, real-world ML research often involves much larger projects using more compute over longer periods of time, so we think that the high-budget results are the most important.
The median AI agent attempt (when we don’t do “best-of-k”) makes very little progress in most environments, and we often observe the agents failing to react appropriately to novel information or struggling to build on their progress over time.
On the other hand, the AI agents appear to have significant expertise in the relevant ML topics. They generate and te
... (truncated, 30 KB total)056e0ff33675b825 | Stable ID: OGExZTMwN2