RE-Bench
AgenticResearch Engineering Benchmark from METR — evaluates AI agents on 7 challenging ML research engineering tasks requiring multi-step problem solving over extended time horizons.
Models Tested
0
Scoring: percentage
Introduced: 2024-11
Maintainer: METR
No model scores recorded for this benchmark yet.