METR (Model Evaluation & Threat Research)

web

evals.alignment.org·evals.alignment.org/

METR (formerly ARC Evals) is a leading independent organization conducting pre-deployment capability evaluations for frontier AI labs; their work directly informs safety policies at OpenAI, Anthropic, and others.

Metadata

Importance: 82/100tool pagehomepage

Summary

METR (formerly ARC Evals) conducts research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous capabilities, AI R&D acceleration potential, and evaluation integrity. They are notable for developing the 'time horizon' metric measuring how long AI agents can complete tasks, and for conducting pre-deployment evaluations for major AI labs.

Key Points

•Develops and applies the 'task-completion time horizon' metric, showing exponential growth in AI agent capabilities over 6 years
•Conducts pre-deployment safety evaluations for frontier AI models (e.g., GPT-5.1) assessing catastrophic risk vectors like self-improvement and rogue replication
•Researches evaluation integrity threats including sandbagging and generalized reward hacking, publishing datasets like MALT
•Analyzes frontier AI safety policies across major labs and publishes policy guidance on risk transparency and capability thresholds
•Produces resources on measuring autonomous AI capabilities and monitorability evaluations for AI agents

Cited by 3 pages

Page	Type	Quality
AI Proliferation Risk Model	Analysis	65.0
Scheming Likelihood Assessment	Analysis	61.0
AI Evaluation	Approach	72.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20269 KB

METR 

 

 

 
 

 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 Research 
 

 
 
 Notes 
 

 
 
 Updates 
 

 
 
 About 
 

 
 
 Donate 
 

 
 
 Careers 
 

 

 
 
 
 Search
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 -->
 
 
 
 

 
 
 
 
 
 
 
 Research 
 

 
 
 
 
 
 
 Notes 
 

 
 
 
 
 
 
 Updates 
 

 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 Careers 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 Menu 
 
 
 

 
 
 

 
 
 

 

 

 
 
 Model Evaluation & Threat Research 
 METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems. 
 
 Our research 
 Careers 
 
 
 
 
 
 We’ve worked with 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 Time Horizon 1.1 (Current) TH 1.1 
 
 
 
 
 
 
 Time Horizon 1.1 (Current) 
 Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement. 
 
 
 Time Horizon 1.0 (Mar 2025) 
 Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper. 
 
 
 
 
 Log Scale 
 Linear Scale 
 
 
 50% Success 
 80% Success 
 
 
 
 
 
 
 
 
 
 

 

 

 
 
 Task-Completion Time Horizons of Frontier AI Models 

 We propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase in this time horizon metric over the past 6 years.

 
 
 Read paper 
 View repo 
 
 
 
 

 
 
 Featured research

 

 Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.

 View all research 

 
 
 
 
 General 
 

 
 Technical 
 

 
 Policy 
 

 
 View all research 
 

 
 
 
 
 
 
 
 GPT-5.1 Evaluation Results

 We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.

 
 Read more 
 
 
 
 
 
 
 
 Measuring AI Ability to Complete Long Tasks

 We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.

 
 Read more 
 
 
 
 
 
 
 
 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

 We find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

 
 Read more 
 
 
 
 
 
 
 
 
 
 MALT

 A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging

 
 Read more 
 
 
 
 
 
 
 
 
 
 Measuring autonomous AI capabilities — resource collection

 An index of our research and guidance on how to measure AI systems' ability to autonomously complete a wide range of multi-hour tasks

 
 Read more 
 
 
 
 
 
 
 
 Early 

... (truncated, 9 KB total)

Resource ID: 1648010fd1ff0370 | Stable ID: sid_rXbC5HNHn9