METR Time Horizons - Epoch AI

web

Epoch AI·epoch.ai/benchmarks/metr-time-horizons

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Epoch AI

Metadata

Cited by 1 page

Page	Type	Quality
Eval Saturation & The Evals Gap	Approach	65.0

Cached Content Preview

HTTP 200Fetched May 30, 20262 KB

METR Time Horizons | Epoch AI

METR Time Horizons

METR Time Horizons Methodology METR Time Horizons

The tasks considered include

RE-Bench, a set of machine learning research engineering tasks,

HCAST, a more general set of challenging software engineers tasks, including ML engineering, and

SWAA, a set of smaller tasks that involve operating computer software

The models’ final duration is assigned based on the longest task that the model can complete. The duration of the task for that purpose is judged by the time it takes a human to complete that task.

Methodology

We source time horizons directly from METR’s own analysis. The live leaderboard is available at metr.org/time-horizons .

The methodology to estimate a model’s time horizon is as follows:

Collect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model’s performance. Each model was run around 8 times on each task. Most models are evaluated with METR’s modular-public agent scaffold, although the scaffold was slightly modified for o1 and o1-preview. Then, run human baselining experiments to obtain human completion times for these tasks.

Estimate time horizon: For each AI model, fit a logistic regression curve that predicts the probability of task success based on the logarithm of the human completion time for that task. The model’s 50% time horizon is the human task completion time at which the fitted logistic curve for a given model intersects the 50% success probability threshold. This metric represents the estimated time (in minutes or hours) that a human expert would typically take to complete tasks which the AI model can complete with a 50% success rate.

For full details on the task suites, human baselining, agent setups, and curve fitting, please refer to METR’s papers Measuring AI Ability to Complete Long Tasks , RE-Bench: Evaluating frontier AI R&#x26;D capabilities of language model agents against human experts , and HCAST: Human-Calibrated Autonomy Software Tasks .

Feedback Feedback

Have a question? Noticed something wrong? Let us know.

Message If you would like a reply, please include your name and email address.

Name Email address Cancel Submit METR Time Horizons

Durations of the longest task that models can complete correctly more often than not, across a set of software engineering and related tasks.

Resource ID: 5205868f6f7f3d48 | Stable ID: sid_V0nB4P0CTQ