Skip to content
Longterm Wiki
Back

[2411.15114] RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

paper

Authors

Hjalmar Wijk·Tao Lin·Joel Becker·Sami Jawhar·Neev Parikh·Thomas Broadley·Lawrence Chan·Michael Chen·Josh Clymer·Jai Dhyani·Elena Ericheva·Katharyn Garcia·Brian Goodrich·Nikola Jurkovic·Holden Karnofsky·Megan Kinniment·Aron Lajko·Seraphina Nix·Lucas Sato·William Saunders·Maksym Taran·Ben West·Elizabeth Barnes

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces RE-Bench, a benchmark for evaluating AI agents' R&D capabilities against human experts, directly addressing concerns about AI-driven automation of AI research highlighted in frontier AI safety policy discussions.

Paper Details

Citations
0
3 influential
Year
2025
Methodology
peer-reviewed
Categories
PLOS One

Metadata

arxiv preprintprimary source

Abstract

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.

Summary

RE-Bench is a new benchmark for evaluating AI agents' capabilities in research engineering tasks, consisting of 7 open-ended ML research environments with baseline data from 71 human expert attempts. The study finds that frontier AI models achieve 4x higher scores than human experts on a 2-hour time budget, but humans maintain an advantage with longer time budgets, exceeding top AI agents by 2x when given 32 hours. While AI agents demonstrate strong technical expertise and can generate solutions 10x faster than humans at lower cost, the benchmark reveals important differences in how humans and AI systems scale with additional time and resources.

Cited by 1 page

PageTypeQuality
METROrganization66.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
\\pdfcolInitStack

tcb@breakable

# RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk
Tao Lin
Joel Becker
Sami Jawhar
Neev Parikh
Thomas Broadley
Lawrence Chan22footnotemark: 2
Michael Chen22footnotemark: 2
Josh Clymer22footnotemark: 2  ,
Jai Dhyani22footnotemark: 2  ,
Elena Ericheva22footnotemark: 2
Katharyn Garcia22footnotemark: 2
Brian Goodrich22footnotemark: 2
Nikola Jurkovic22footnotemark: 2  ,
Megan Kinniment22footnotemark: 2
Aron Lajko22footnotemark: 2  , 33footnotemark: 3
Seraphina Nix22footnotemark: 2
Lucas Sato22footnotemark: 2
William Saunders22footnotemark: 2
Maksym Taran22footnotemark: 2
Ben West22footnotemark: 2
Elizabeth Barnes

Qally’s. Work done in collaboration with METR.Ordered alphabetically.Redwood Research. Work done while at METR.Independent. Work done in collaboration with METR.Harvard University. Work done while at METR.Model Evaluation and Threat Research (METR)

(November 20th, 2024)

###### Abstract

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate.
However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance.
We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts.
We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions.
We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4× higher than human experts when both are given a total time budget of 2 hours per environment.
However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2× the score of the top AI agent when both are given 32 total hours (across different attempts).
Qualitatively, we find that modern AI agents possess significant expertise in many ML topics—e.g. an agent wrote a faster custom Triton kernel than any of our human experts’—and can generate and test solutions over ten times faster than humans, at much lower cost.
We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.111Environments can be found at [github.com/METR/ai-rd-tasks](https://github.com/METR/ai-rd-tasks/tree/main "") and agent trajectories can be found at [transcripts.metr.org](https://transcripts.metr.org/ ""). Analysis code and anonymized human expert data coming soon.

## 1 Introduction

Large language models are increasingly capable of complex programming tasks and are already being used to acce

... (truncated, 98 KB total)
Resource ID: 8cf2074dc6198776 | Stable ID: NjhlZmQyNj