Back
METR (Model Evaluation & Threat Research)
webevals.alignment.org·evals.alignment.org/
METR (formerly ARC Evals) is a leading independent organization conducting pre-deployment capability evaluations for frontier AI labs; their work directly informs safety policies at OpenAI, Anthropic, and others.
Metadata
Importance: 82/100tool pagehomepage
Summary
METR (formerly ARC Evals) conducts research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous capabilities, AI R&D acceleration potential, and evaluation integrity. They are notable for developing the 'time horizon' metric measuring how long AI agents can complete tasks, and for conducting pre-deployment evaluations for major AI labs.
Key Points
- •Develops and applies the 'task-completion time horizon' metric, showing exponential growth in AI agent capabilities over 6 years
- •Conducts pre-deployment safety evaluations for frontier AI models (e.g., GPT-5.1) assessing catastrophic risk vectors like self-improvement and rogue replication
- •Researches evaluation integrity threats including sandbagging and generalized reward hacking, publishing datasets like MALT
- •Analyzes frontier AI safety policies across major labs and publishes policy guidance on risk transparency and capability thresholds
- •Produces resources on measuring autonomous AI capabilities and monitorability evaluations for AI agents
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Proliferation Risk Model | Analysis | 65.0 |
| Scheming Likelihood Assessment | Analysis | 61.0 |
| AI Evaluation | Approach | 72.0 |
Cached Content Preview
HTTP 200Fetched Mar 15, 20268 KB
METR
Research
Notes
Updates
About
Donate
Careers
Search
-->
Research
Notes
Updates
About
Donate
Careers
Menu
Model Evaluation & Threat Research
METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems.
Our research
Careers
We’ve worked with
Time Horizon 1.1 (Current) TH 1.1
Time Horizon 1.1 (Current)
Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement.
Time Horizon 1.0 (Mar 2025)
Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper.
Log Scale
Linear Scale
50% Success
80% Success
Task-Completion Time Horizons of Frontier AI Models
We propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase in this time horizon metric over the past 6 years.
Read paper
View repo
Featured research
Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.
View all research
General
Technical
Policy
View all research
GPT-5.1 Evaluation Results
We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.
Read more
Measuring AI Ability to Complete Long Tasks
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.
Read more
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
We find that when developers use AI tools, they take 19% longer than without—AI makes them slower.
Read more
MALT
A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging
Read more
Measuring autonomous AI capabilities — resource collection
An index of our research and guidance on how to measure AI systems' ability to autonomously complete a wide range of multi-hour tasks
Read more
Early
... (truncated, 8 KB total)Resource ID:
1648010fd1ff0370 | Stable ID: ZTIzNGJjZT