Skip to content
Longterm Wiki
Back

Evaluation Methodology

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

METR is a key organization in AI safety evaluation; their methodologies are used by Anthropic, OpenAI, and others as part of responsible scaling policies, making this a reference point for anyone studying AI evaluation frameworks.

Metadata

Importance: 72/100tool pagehomepage

Summary

METR (Model Evaluation & Threat Research) develops rigorous methodologies for evaluating AI systems, focusing on assessing autonomous capabilities and potential risks from advanced AI models. Their work establishes frameworks for measuring dangerous capabilities including deception, autonomous replication, and other safety-relevant behaviors. METR's evaluations inform deployment decisions and safety thresholds for frontier AI labs.

Key Points

  • Develops standardized evaluation frameworks for assessing dangerous or safety-relevant capabilities in frontier AI models
  • Focuses on autonomous task completion, deceptive alignment indicators, and potential for self-replication or resource acquisition
  • Conducts third-party evaluations used by major AI labs to inform deployment and safety decisions
  • Research informs policy discussions around responsible scaling policies and capability thresholds
  • Methodology bridges technical AI safety research and practical governance requirements for frontier models

Cited by 3 pages

PageTypeQuality
Mesa-Optimization Risk AnalysisAnalysis61.0
METROrganization66.0
Third-Party Model AuditingApproach64.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202615 KB
[![METR Logo](https://metr.org/assets/images/logo/logo.svg)](https://metr.org/)

- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)

Menu

[![We are Changing our Developer Productivity Experiment Design](https://metr.org/assets/images/uplift-2026-post/uplift_timeline.png)\\
\\
We are Changing our Developer Productivity Experiment Design\\
\\
24 February 2026\\
\\
\\
Our second developer productivity study faces selection effects from wider AI adoption, prompting us to redesign our approach.\\
\\
Read more](https://metr.org/blog/2026-02-24-uplift-update/)

[![Time Horizon 1.1](https://metr.org/assets/images/time-horizon-1-1/time-horizon-1-vs-1-1-hybrid.png)\\
\\
Time Horizon 1.1\\
\\
29 January 2026\\
\\
\\
We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.\\
\\
Read more](https://metr.org/blog/2026-1-29-time-horizon-1-1/)

[![Early work on monitorability evaluations](https://metr.org/assets/images/monitorability-post/time_horizon_ratio_non_visible_reasoning.png)\\
\\
Early work on monitorability evaluations\\
\\
22 January 2026\\
\\
\\
We show preliminary results on a prototype evaluation that tests monitors' ability to catch AI agents doing side tasks, and AI agents' ability to bypass this monitoring.\\
\\
Read more](https://metr.org/blog/2026-01-19-early-work-on-monitorability-evaluations/)

[![GPT-5.1-Codex-Max Evaluation Results](https://evaluations.metr.org/image/gpt_5_1_codex_max_report/models_50_time_horizon.png)\\
\\
GPT-5.1-Codex-Max Evaluation Results\\
\\
19 November 2025\\
\\
\\
We evaluate whether GPT-5.1-Codex-Max poses significant catastrophic risks via AI self-improvement or rogue replication. We conclude that this seems unlikely.\\
\\
Read more](https://evaluations.metr.org/gpt-5-1-codex-max-report/)

[![MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity](https://metr.org/assets/images/malt-post/overall_dataset.png)\\
\\
MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity\\
\\
14 October 2025\\
\\
\\
MALT (Manually-reviewed Agentic Labeled Transcripts) is a dataset of natural and prompted examples of behaviors that threaten evaluation integrity (like generalized reward hacking or sandbagging).\\
\\
Read more](https://metr.org/blog/2025-10-14-malt-dataset-of-natural-and-prompted-behaviors/)

[![Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study](https://metr.org/assets/images/forecasting-impacts/probability-of-acceleration.png)\\
\\
Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study\\
\\
20 August 2025\\
\\
\\
AI agents are improving rapidly at autonomous software development and machine learning tasks, and, if recent trends hold, may match human researchers at challenging months-long research projects in 

... (truncated, 15 KB total)
Resource ID: a4652ab64ea54b52 | Stable ID: MGVjNmZjNG