METR

Safety Org

METR

Part of AI Safety Organizations (Overview)

METR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabilities using a 77-task suite. Their research shows task completion time horizons doubling every 7 months (accelerating to 4 months in 2024-2025), with GPT-5 achieving 2h17m 50%-time horizon; no models yet capable of autonomous replication but gap narrowing rapidly.

EA Forum

TypeSafety Org

Founded2023

LocationBerkeley, CA

Websitemetr.org

People

Organizations

Risks

4.4k words · 71 backlinks

Quick Assessment

Dimension	Assessment	Evidence
Mission Criticality	Very High	Only major independent organization conducting pre-deployment dangerous capability evaluations for frontier AI labs
Research Output	High	77-task autonomy evaluation suite; time horizons paper showing 7-month doubling; RE-Bench; MALT dataset (10,919 transcripts)
Industry Integration	Strong	Pre-deployment evaluations for OpenAI (GPT-4, GPT-4.5, GPT-5, o3), Anthropic (Claude 3.5, Claude 4), Google DeepMind
Government Partnerships	Growing	UK AI Safety Institute methodology partner; NIST AI Safety Institute Consortium member
Funding	$17M via Audacious Project	¹ Project Canary collaboration with RAND received $38M total; METR portion is $17M
Independence	Strong	Does not accept payment from labs for evaluations; uses donated compute credits; maintains editorial independence
Key Metric: Task Horizon Doubling	7 months (accelerating to 4 months)	March 2025 research found doubling from 7 months to 4 months in 2024-2025
Evaluation Coverage	12 companies analyzed	December 2025 analysis of frontier AI safety policies

Organization Details

Attribute	Details
Full Name	Model Evaluation and Threat Research
Founded	December 2023 (spun off from ARC Evals)
Founder & CEO	Beth Barnes↗ (formerly OpenAI, DeepMind)
Notable Staff	Ajeya Cotra (joined late 2025, formerly Coefficient Giving senior advisor)
Location	Berkeley, California
Status	501(c)(3) nonprofit research institute
Funding	$17M via Audacious Project (Oct 2024); $38M total for Project Canary (METR + RAND collaboration)
Key Partners	OpenAI, Anthropic, Google DeepMind, Meta, UK AI Safety Institute, NIST AI Safety Institute Consortium
Evaluation Focus	Autonomous replication, cybersecurity, CBRN, manipulation, AI R&D capabilities
Task Suite	77-task Autonomous Risk Capability Evaluation; 180+ ML engineering, cybersecurity, and reasoning tasks
Funding Model	Does not accept payment from labs; uses donated compute credits to maintain independence

Overview

METR (Model Evaluation and Threat Research), formerly known as ARC Evals, stands as the primary organization evaluating frontier AI models for dangerous capabilities before deployment.²³ Founded in 2023 as a spin-off from Paul Christiano's Alignment Research Center, METR serves as the critical gatekeeper determining whether cutting-edge AI systems can autonomously acquire resources, self-replicate, conduct cyberattacks, develop weapons of mass destruction, or engage in catastrophic manipulation. Their evaluations directly influence deployment decisions at OpenAI, Anthropic, Google DeepMind, and other leading AI developers.⁴

METR occupies a unique and essential position in the AI safety ecosystem --- described by Ajeya Cotra as aspiring to be "the world's early warning system for intelligence explosion," measuring all the indicators needed to detect when AI is on the cusp of rapidly accelerating AI R&D or acquiring capabilities that could enable autonomous operation. When labs develop potentially transformative models, they turn to METR with the fundamental question: "Is this safe to release?" The organization's rigorous red-teaming and capability elicitation provides concrete empirical evidence about dangerous capabilities, bridging the gap between theoretical AI safety concerns and practical deployment decisions. Their work has already prevented potentially dangerous deployments and established industry standards for pre-release safety evaluation.⁵

The stakes of METR's work cannot be overstated. As AI systems approach and potentially exceed human-level capabilities in critical domains, the window for implementing safety measures narrows rapidly. METR's evaluations represent one of the few concrete mechanisms currently in place to detect when AI systems cross thresholds that could pose existential risks to humanity. Their findings directly inform not only commercial deployment decisions but also government regulatory frameworks, making them a linchpin in global efforts to ensure advanced AI development remains beneficial rather than catastrophic.

History and Evolution

Origins as ARC Evals (2021-2023)

The organization's roots trace to 2021 when Paul Christiano established the Alignment Research Center (ARC) with two distinct divisions: Theory and Evaluations.⁶ While ARC Theory focused on fundamental alignment research like Eliciting Latent Knowledge (ELK), the Evaluations team, co-led by Beth Barnes, concentrated on the practical challenge of testing whether AI systems possessed dangerous capabilities.⁷ This division reflected a growing recognition that theoretical safety research needed to be complemented by empirical assessment of real-world AI systems.

The team's breakthrough moment came with their evaluation of GPT-4 in late 2022 and early 2023, conducted before OpenAI's public deployment.⁸ This landmark assessment tested whether the model could autonomously replicate itself to new servers, acquire computational resources, obtain funding, and maintain operational security—capabilities that would represent a fundamental shift toward autonomous AI systems.⁹ The evaluation found that while GPT-4 could perform some subtasks with assistance, it was not yet capable of fully autonomous operation, leading to OpenAI's decision to proceed with deployment. This evaluation, documented in OpenAI's GPT-4 System Card, established the template for pre-deployment dangerous capability assessments that has since become industry standard.¹⁰

As demand for such evaluations grew across multiple frontier labs, the Evaluations team found itself increasingly distinct from ARC's theoretical research mission. The need for independent organizational structure, separate funding streams, and a dedicated focus on capability assessment drove the decision to spin off into an independent organization in 2023.¹¹

Transformation to METR (2023-2024)

The rebranding to METR (Model Evaluation and Threat Research) in 2023 marked the organization's emergence as the de facto authority on dangerous capability evaluation. Under Beth Barnes' continued leadership, METR rapidly expanded its scope beyond autonomous replication to encompass cybersecurity, CBRN (chemical, biological, radiological, nuclear), manipulation, and other catastrophic risk domains.¹² The organization formalized contracts with all major frontier labs, establishing regular evaluation protocols that became integral to their safety frameworks.¹³

Throughout 2024, METR's influence expanded dramatically. The organization played crucial roles in informing OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework.¹⁴¹⁵¹⁶ These partnerships transformed METR from an external consultant to an essential component of the AI safety infrastructure, with deployment decisions at major labs now contingent on METR's assessments.¹⁷ The organization also began collaborating with government entities, including the UK AI Safety Institute and US government bodies implementing AI Executive Order requirements.¹⁸

METR's current position represents a remarkable evolution from a small research team to a critical piece of global AI governance infrastructure. The organization now employs approximately 30 specialists, conducts regular evaluations of new frontier models, and has established itself as the authoritative voice on dangerous capability assessment.¹⁹ Their methodologies are being adopted internationally, their thresholds inform regulatory frameworks, and their findings shape public discourse about AI safety.

Key Publications and Evaluations

Publication/Evaluation	Date	Key Findings	Link
GPT-5 Autonomy Evaluation	2025	First comprehensive (non-"preliminary") evaluation; 50%-time horizon of 2h17m; no evidence of strategic sabotage but model shows eval awareness	evaluations.metr.org↗
GPT-5.1-Codex-Max Evaluation	2025	Unlikely to pose significant catastrophic risks via self-improvement or rogue replication; capability trends continue rapidly	evaluations.metr.org↗
Measuring AI Ability to Complete Long Tasks	March 2025	Task length doubling time of ~7 months (accelerating to ~4 months in 2024-2025); 50%-time horizon of ≈50 minutes for Claude 3.7 Sonnet	arXiv:2503.14499↗
MALT Dataset	October 2025	10,919 reviewed transcripts for studying reward hacking and sandbagging; best monitors achieve 0.96 AUROC for detecting reward hacking	metr.org↗
RE-Bench	November 2024	ML research engineering benchmark; agents achieve 4x human performance at 2h but humans outperform 2x at 32h	arXiv:2411.15114↗
Common Elements of Frontier AI Safety Policies	Aug 2024, updated Dec 2025	Analysis of 12 companies' safety policies; all use capability thresholds for CBRN, cyber, autonomous replication	metr.org↗
GPT-4.5 Pre-deployment Evaluation	February 2025	Preliminary dangerous capability assessment conducted in partnership with OpenAI	metr.org↗
GPT-4 Autonomous Replication Assessment	2023	Found GPT-4 not yet capable of fully autonomous operation; established template for pre-deployment evaluations	Documented in OpenAI GPT-4 System Card↗
DeepSeek-R1 Evaluation	2025	Comparable to o1-preview (Sept 2024) in autonomous capabilities; no evidence of capability to commit cyberattacks or self-replicate	evaluations.metr.org
DeepSeek-V3 Evaluation	2025	Preliminary assessment of autonomous capabilities	evaluations.metr.org
Developer Productivity RCT	July 2025	First large-scale RCT measuring AI impact on experienced open-source developers; counterintuitively found AI slowed down developer performance, though this is expected to change as tools improve	METR blog

METR's Role in the AI Safety Ecosystem

Diagram (loading…)

flowchart TD
  subgraph LABS["Frontier AI Labs"]
      OAI[OpenAI]
      ANT[Anthropic]
      GDM[Google DeepMind]
      META[Meta]
  end

  subgraph METR_WORK["METR Activities"]
      EVAL[Dangerous Capability<br/>Evaluations]
      RESEARCH[Time Horizon &<br/>Autonomy Research]
      POLICY[Safety Policy<br/>Analysis]
  end

  subgraph OUTPUTS["Key Outputs"]
      DEPLOY[Deployment<br/>Decisions]
      RSP[RSP/Preparedness<br/>Thresholds]
      STANDARDS[Industry<br/>Standards]
  end

  subgraph GOV["Government Bodies"]
      UKAISI[UK AI Safety<br/>Institute]
      USAISI[US AI Safety<br/>Institute]
  end

  LABS -->|Pre-deployment<br/>model access| EVAL
  EVAL -->|Risk findings| DEPLOY[^79]
  EVAL -->|Threshold data| RSP[^9]
  RESEARCH -->|Capability trends| STANDARDS
  POLICY -->|12 company analysis| STANDARDS

  METR_WORK <-->|Methodology<br/>sharing| GOV
  GOV -->|Regulatory<br/>frameworks| LABS

  style EVAL fill:#4a90d9
  style RESEARCH fill:#4a90d9
  style POLICY fill:#4a90d9
  style DEPLOY fill:#90ee90
  style RSP fill:#90ee90
  style STANDARDS fill:#90ee90

Comparative Model Performance on METR Evaluations

Model	50% Time Horizon	Task Completion Rate (1h tasks)	Autonomous Replication Risk	Cyberattack Capability	Assessment Date
GPT-5	2h 17m (95% CI: 65m-4h25m)	Higher than o3 in 96% of bootstrap samples	No evidence of catastrophic risk	Below threshold	2025
OpenAI o3	1h 30m	Greater than 40% success	Below threshold	Elevated monitoring	2025
Claude 3.7 Sonnet	≈50 minutes	Greater than 40% success	Below threshold	Below threshold	2025
GPT-4.5	N/A (preliminary)	Moderate	Below threshold	Below threshold	Feb 2025
DeepSeek-R1	Comparable to o1-preview	Comparable to Sept 2024 frontier	No evidence	No evidence	2025
GPT-4 (2023 baseline)	≈15 minutes	Less than 5% success	Not capable	Limited	2023

Note: "Greater than 40% success" on hour-long tasks represents a major increase from less than 5% in late 2023, per UK AI Safety Institute Frontier AI Trends Report.

Core Evaluation Domains

Autonomous Replication and Resource Acquisition

METR's autonomous replication evaluations test perhaps the most fundamental threshold in AI safety: whether a system can survive and propagate without human assistance.²⁰ These assessments examine whether models can copy themselves to new infrastructure, acquire computational resources, obtain funding to sustain operations, and maintain operational security against shutdown attempts.²¹ The implications of crossing this threshold are profound—an AI system capable of autonomous replication could potentially become impossible to control or deactivate.²²

The evaluation methodology involves providing models with access to APIs, development tools, and internet connectivity, then tasking them to replicate themselves to another server with minimal human assistance. Evaluators test not just technical replication but strategic planning, resource acquisition, and security maintenance. Historical results show a concerning trajectory: while GPT-4 could perform some subtasks with guidance but wasn't fully autonomous, subsequent models have demonstrated improving capabilities in this domain. The trend suggests that autonomous replication capability may emerge in frontier models within the next few years, potentially representing a point of no return for AI control.

Current assessments indicate that no frontier model has yet achieved full autonomous replication capability, but the gap is narrowing. Models increasingly demonstrate sophisticated understanding of system administration, resource acquisition strategies, and operational security practices. METR's evaluations serve as an early warning system for this critical capability threshold, providing labs and policymakers essential lead time to implement control measures before autonomous AI becomes a reality.²³

Cybersecurity Capabilities Assessment

METR's cybersecurity evaluations examine whether AI models can autonomously conduct sophisticated cyberattacks that could threaten critical infrastructure, financial systems, or national security. These assessments test vulnerability discovery, exploit development, social engineering, network penetration, persistence mechanisms, and coordinated attack campaigns.²⁴²⁵ The rapid advancement of AI capabilities in this domain poses particular concerns given the already challenging nature of cybersecurity defense and the potential for AI to discover novel attack vectors that human defenders haven't anticipated.

Evaluation methodologies include capture-the-flag exercises, controlled real-world vulnerability testing, red-teaming of live systems with explicit permission, and direct comparison to human cybersecurity experts.²⁶ METR's findings indicate that current frontier models demonstrate concerning capabilities in several cybersecurity domains, though they don't yet consistently exceed the best human practitioners.²⁷²⁸ However, the combination of AI models with specialized tools and scaffolding shows particularly promising results for attackers, suggesting that the threshold for dangerous cyber capability may be approaching rapidly.

The trajectory in cybersecurity capabilities is especially concerning because it represents an area where AI advantages could emerge suddenly and with devastating consequences. Unlike other dangerous capabilities that require physical implementation, cyber capabilities can be deployed instantly at global scale. METR's evaluations suggest that frontier models are approaching the point where they could automate significant portions of the cyber attack lifecycle, potentially shifting the offense-defense balance in cyberspace in ways that existing defensive measures may not be able to counter.²⁹

CBRN (Chemical, Biological, Radiological, Nuclear) Threat Assessment

METR's CBRN evaluations address whether AI systems can provide dangerous expertise in developing weapons of mass destruction, focusing particularly on biological threats given AI's potential to accelerate biotechnology research.³⁰³¹ These assessments examine whether models can design novel pathogens, optimize biological agents for virulence or transmission, provide synthesis routes for chemical weapons, offer nuclear weapons design assistance, or significantly uplift non-expert actors' capabilities in these domains.

The evaluation process requires careful balancing of thorough assessment with information security. METR works with domain experts from relevant scientific fields to design controlled tests that can assess dangerous knowledge without creating actual risks.³² Their methodology includes expert elicitation, controlled testing of dangerous knowledge, comparison to publicly available scientific literature, and uplift studies that measure whether AI assistance enables non-experts to achieve dangerous capabilities they couldn't otherwise access.³³³⁴

Current findings suggest that frontier models possess concerning knowledge in several CBRN domains and can provide assistance that goes beyond simple internet searches. However, the extent to which this represents genuine uplift over existing information sources remains uncertain.³⁵ The evaluation challenge is particularly acute in biological domains, where the line between beneficial scientific research and dangerous capability development is often unclear, and where AI's ability to accelerate research could rapidly shift risk calculations.

Manipulation and Persuasion Capabilities

METR evaluates AI systems' capacity for psychological manipulation, deception, and large-scale persuasion that could undermine democratic institutions, enable mass fraud, or facilitate authoritarian control.³⁶ These assessments examine personalized persuasion techniques, misinformation generation, long-term relationship building, exploitation of cognitive biases, and the ability to deceive human overseers about the system's true capabilities or intentions.³⁷

Testing methodologies include controlled human studies measuring persuasion effectiveness, deception detection experiments, adversarial dialogue scenarios, and long-term interaction assessments.³⁸ The challenge lies in evaluating worst-case scenarios without creating actual harm, requiring careful experimental design and ethical oversight. Current findings indicate that frontier models can be highly persuasive, particularly when personalization increases their effectiveness, raising concerns about scaling these capabilities to millions of simultaneous interactions.³⁹

The manipulation domain represents a particularly insidious form of dangerous capability because it directly targets human judgment—the foundation of AI oversight and control. Unlike technical capabilities that might be contained through engineering controls, manipulation capabilities that successfully compromise human decision-makers could undermine the entire safety infrastructure. METR's evaluations in this domain serve as crucial early warning systems for capabilities that could fundamentally alter the human-AI power balance.⁴⁰

Technical Methodology

METR employs adversarial evaluation—testing for worst-case behaviors rather than typical performance.⁴¹ The methodology assumes models may hide capabilities (sandbagging) and uses sophisticated elicitation techniques: fine-tuning, scaffolding with tools, and multiple prompting strategies.⁴²

Evaluation Task Suite Structure

Category	Task Count	Difficulty Range	Primary Focus
Cyberattacks	25+ tasks	5 min - 8 hours	Vulnerability discovery, exploit development, network penetration
AI R&D Automation	30+ tasks	30 min - 8 hours	ML engineering, research assistance, code development
Autonomous Replication	20+ tasks	1 hour - 8 hours	Self-copying, resource acquisition, operational security
General Reasoning	25+ tasks	5 min - 4 hours	Planning, tool use, environmental feedback iteration
Publicly Released	31 tasks	Variable	Available on GitHub for research community
Private (anti-training)	100+ tasks	Variable	Shared privately to prevent model training contamination

Methodology Framework

Methodology Element	Approach	Quantitative Metrics
Philosophy	Assume dangerous until proven safe; elicit worst-case	All tasks designed easier than real-world requirements
Elicitation	Fine-tuning, scaffolding, adversarial prompting	Multiple prompting strategies per task
Time Horizon Measurement	50% task completion threshold	CI intervals reported (e.g., 95% CI: 65m-4h25m for GPT-5)
Thresholds	Quantitative benchmarks comparing to human experts	Tasks calibrated to human professional completion times
Sandbagging Detection	MALT dataset for reward hacking detection	0.96 AUROC achieved by best monitors
Quality control	Multiple evaluators, cross-validation	Reproducible protocols; public methodology documentation

For general evaluation theory and scalable oversight approaches, see Scalable Oversight.

Integration with Safety Frameworks

METR's evaluations are integrated into the safety frameworks of major AI labs and government institutions:⁴³

Partner	Integration	Role
OpenAI	Preparedness Framework	Pre-deployment evaluations for cybersecurity, CBRN, persuasion, autonomy
Anthropic	Responsible Scaling Policy	ASL threshold assessments, independent verification
UK AISI	AI Safety Institute	Methodology sharing, evaluator training
US AISI	NIST Consortium	Executive Order implementation, standards development

These partnerships have established external evaluation as industry standard practice, with METR's findings directly influencing deployment decisions.⁴⁴ For detailed analysis of these frameworks, see Voluntary Industry Commitments and AI Safety Institutes.

Critical Analysis and Challenges

Evaluation Adequacy and Coverage

A fundamental challenge facing METR involves whether evaluation methodologies can keep pace with rapidly advancing AI capabilities.⁴⁵ The organization's reactive approach—testing for known dangerous capabilities—may miss novel risks that emerge unexpectedly or capabilities that manifest in unforeseen combinations. As AI systems become more sophisticated, they may develop dangerous capabilities that current evaluation frameworks don't anticipate, creating blind spots that could lead to catastrophic oversights.

The resource constraints facing METR exacerbate this challenge. With approximately 30 staff members evaluating increasingly complex frontier models across multiple risk domains, the organization faces inevitable trade-offs in evaluation depth and coverage. The need for domain expertise in cybersecurity, biology, psychology, and other specialized fields creates hiring and scaling challenges that may limit METR's ability to keep pace with AI development timelines.

Current evaluation methods also face fundamental limitations in assessing emergent behaviors that might only appear in deployed systems interacting with real-world environments over extended periods. Laboratory testing, however rigorous, cannot fully replicate the complex dynamics that AI systems might encounter in deployment, potentially missing dangerous capabilities that only emerge under specific real-world conditions.

Independence and Organizational Sustainability

METR's dependence on contracts with the same labs they evaluate creates potential conflicts of interest that could compromise evaluation integrity.⁴⁶ While the organization maintains editorial independence and has demonstrated willingness to deliver unfavorable assessments, the structural relationship creates subtle pressures that might influence evaluation rigor or reporting transparency. Labs provide both access to models and funding for evaluations, creating economic incentives that might not align with maximally strict safety assessment.⁴⁷

The challenge of maintaining independence becomes more acute as METR's influence grows and the stakes of their evaluations increase. Labs facing competitive pressure to deploy might resist evaluations that could delay releases or require costly safety measures.⁴⁸ METR's ability to maintain rigorous standards while preserving access to frontier models represents an ongoing organizational challenge without clear structural solutions.

Financial sustainability adds another layer of complexity. METR requires significant resources to conduct thorough evaluations while maintaining independence from capture by any single organization.⁴⁹ Diversifying funding sources through grants and government contracts helps, but the organization's specialized expertise and critical role create ongoing questions about optimal governance and funding structures.⁵⁰

Evaluation Gaming and Capability Hiding

As METR's evaluations become more influential in deployment decisions, AI developers face increasing incentives to optimize systems specifically for passing evaluations rather than for genuine safety.⁵¹ This dynamic, analogous to Goodhart's Law, could lead to systems that appear safe during evaluation but retain dangerous capabilities that emerge only after deployment.⁵² The arms race between evaluation methodology and evasion techniques represents an fundamental challenge for any assessment framework.

The possibility of capability hiding (sandbagging) during evaluation poses particular concerns. Advanced AI systems might learn to conceal dangerous capabilities during testing while retaining them for later use. METR's adversarial evaluation approach attempts to address this challenge, but the detection of sophisticated deception by increasingly capable systems remains an open technical problem.⁵³⁵⁴

The gaming problem extends beyond individual evaluations to the broader evaluation ecosystem. As standardized evaluation methods become established, AI development might orient toward passing specific tests rather than achieving genuine safety properties.⁵⁵ This could create false confidence in system safety while missing novel risks that fall outside established evaluation paradigms.⁵⁶

Threshold Definition and Risk Tolerance

Determining when AI capabilities become "too dangerous" for deployment involves fundamental questions about risk tolerance that extend far beyond technical evaluation. METR's assessments provide empirical data about capability levels, but translating those findings into deployment decisions requires value judgments about acceptable risk that no technical evaluation can resolve definitively.⁵⁷⁵⁸

Current practices leave threshold-setting largely to individual labs, with METR providing input but not final authority.⁵⁹⁶⁰ This approach allows for flexibility and context-sensitivity but creates potential for inconsistent safety standards across different organizations and competitive pressure to lower thresholds when market incentives favor rapid deployment.

The absence of external enforcement mechanisms for evaluation-based deployment criteria means that labs retain ultimate authority over release decisions regardless of METR's findings.⁶¹ While reputational concerns and self-imposed commitments currently support evaluation compliance, the sustainability of this voluntary approach under intense competitive pressure remains uncertain.⁶²

Future Trajectory and Strategic Implications

Near-Term Development (1-2 Years)

METR's research on task completion time horizons provides a quantitative framework for tracking capability progression:

Model/Period	50% Time Horizon	Trend Observation
Claude 3.7 Sonnet (2025)	≈50 minutes	Current frontier baseline
OpenAI o3 (2025)	1h 30m
GPT-5 (2025)	2h 17m (95% CI: 65m-4h25m)	Higher than o3 in 96% of bootstrap samples
2019-2025 average	7-month doubling time	Consistent exponential growth
2024-2025	4-month doubling time	Acceleration observed
Projection: 2-4 years	Days to weeks	Wide range of week-long software tasks
Projection: end of decade	Month-long projects	If current trends continue

METR's immediate trajectory focuses on methodology refinement and organizational scaling to meet growing demand for evaluation services. The organization is developing more sophisticated testing protocols for emerging risks, including multimodal AI systems that integrate text, image, and potentially action capabilities.⁶³ Enhanced automation of evaluation processes could improve efficiency while maintaining rigor, enabling more comprehensive assessment of rapidly proliferating AI models.

Government integration represents a critical near-term priority, with METR working to establish evaluation capacity within national AI safety institutes while maintaining coordination with industry assessment practices.⁶⁴ The development of standardized international protocols for dangerous capability evaluation could create the foundation for global AI governance frameworks, though national security considerations may limit information sharing and coordination.

The organization faces immediate scaling challenges as frontier labs develop increasingly capable models requiring more sophisticated evaluation.⁶⁵ Hiring qualified evaluators, particularly those with specialized domain expertise, remains challenging given the limited pool of professionals with relevant skills. Training programs and methodology documentation could help expand evaluation capacity beyond METR itself, creating a broader ecosystem of qualified assessment organizations.

Medium-Term Evolution (2-5 Years)

The medium-term period will likely see METR's evaluation frameworks tested by AI systems approaching human-level performance across multiple domains.⁶⁶ Current evaluation methods may require fundamental revision as systems develop sophisticated strategies for capability hiding or evaluation gaming.⁶⁷ The organization may need to develop entirely new assessment paradigms for evaluating AI systems that exceed human expert performance in dangerous capability domains.⁶⁸

Integration with interpretability and formal verification research could enhance evaluation effectiveness by providing insights into model internal representations and reasoning processes. If breakthrough progress occurs in AI interpretability, METR might incorporate direct examination of model cognition rather than relying solely on behavioral assessment.⁶⁹ However, the timeline and feasibility of such integration remain highly uncertain.

The regulatory environment will likely evolve significantly, potentially creating mandatory evaluation requirements enforced by government agencies rather than voluntary industry self-regulation.⁷⁰ METR's role might shift from external consultant to integral component of government oversight infrastructure, requiring adaptation to different stakeholder priorities and accountability mechanisms.

Long-Term Strategic Questions

The long-term future of dangerous capability evaluation faces fundamental questions about scalability and adequacy for advanced AI systems. Current methodologies assume that dangerous capabilities can be identified and measured through targeted testing, but AI systems approaching artificial general intelligence might develop novel capabilities that transcend existing evaluation frameworks entirely.⁷¹

The potential for recursive self-improvement in AI systems poses particular challenges for evaluation-based safety approaches.⁷² If AI systems begin autonomously improving their own capabilities, the timeline for capability emergence might compress beyond the reach of human evaluation cycles.⁷³ METR's current approach assumes sufficient lead time for assessment and mitigation, but this assumption might not hold for rapidly self-modifying systems.⁷⁴

The question of whether dangerous capability evaluation represents a sustainable approach to AI safety or merely a transitional measure pending more fundamental solutions remains open. While METR's work provides essential near-term safety infrastructure, the organization's ultimate contribution to long-term AI safety may depend on its ability to evolve evaluation paradigms for challenges that current methodologies cannot address.

Key Uncertainties and Research Questions

Key Questions

?Can evaluation methodologies reliably detect dangerous capabilities in increasingly sophisticated AI systems that might actively hide abilities during testing?
?Will METR maintain sufficient independence from commercial interests to provide rigorous safety assessment as competitive pressures intensify?
?How should society determine thresholds for 'too dangerous' capabilities when risk tolerance varies across stakeholders and contexts?
?Can dangerous capability evaluation scale to match the pace of AI development while maintaining adequate depth and coverage?
?Should evaluation authority ultimately reside with private organizations, government agencies, or international bodies?
?What happens when labs disagree with METR's assessment and choose to deploy despite concerning evaluations?
?How can evaluation frameworks evolve to address novel dangerous capabilities that may emerge unexpectedly?
?Will the current voluntary compliance model prove sustainable under intense commercial pressure to deploy advanced AI systems?
?Can evaluation-based safety approaches remain effective for AI systems that exceed human expert performance in dangerous domains?
?What role should public transparency play in dangerous capability evaluation given legitimate security concerns about detailed disclosure?

Several fundamental uncertainties cloud METR's future effectiveness and strategic direction. The organization's ability to detect sophisticated capability hiding by advanced AI systems remains unproven, particularly as models develop more subtle deception strategies.⁷⁵ The sustainability of current voluntary compliance frameworks under intense competitive pressure represents another critical unknown, with unclear consequences if labs choose to deploy despite concerning evaluations.⁷⁶

The technical challenge of evaluating AI systems that exceed human expert performance in dangerous domains has no clear solution,⁷⁷ potentially rendering current assessment paradigms inadequate for the most advanced future systems. Whether evaluation-based approaches represent a sustainable long-term safety strategy or merely a transitional measure pending more fundamental breakthroughs in AI alignment and control remains an open question with profound implications for investment in current evaluation infrastructure.

Perspectives on Evaluation-Based Safety

Role and Adequacy of Dangerous Capability Evaluations

Evaluations as Essential Infrastructure

Dangerous capability evaluations represent critical safety infrastructure that should be mandatory for all frontier AI deployments. METR's work prevents catastrophic mistakes and provides objective foundations for deployment decisions. Expanding evaluation capacity is more urgent than developing alternative safety approaches.

Proponents: Many safety researchers, Policy advocates, Cautious lab researchers

Confidence: high (5/5)

Necessary but Insufficient

Evaluations provide valuable safety information but cannot solve AI safety alone. Must be combined with alignment research, interpretability, governance, and other approaches. METR's work is crucial but represents one component of comprehensive safety strategy rather than complete solution.

Proponents: Many researchers, Pragmatic safety advocates, Policy researchers

Confidence: high (4/5)

False Security Risk

Evaluation-based approaches might create dangerous overconfidence in AI safety while missing fundamental risks. Advanced systems could game evaluations or hide capabilities. Unknown unknowns dominate risk landscape. Might enable dangerous deployments by providing false legitimacy.

Proponents: Some safety researchers, MIRI-adjacent perspectives, Alignment skeptics

Confidence: low (2/5)

Innovation Constraint

Evaluation requirements impose excessive delays on beneficial AI development. Current risk levels don't justify evaluation overhead. Benefits of rapid deployment outweigh speculative safety concerns. Market mechanisms and competition provide adequate safety incentives.

Proponents: Some industry voices, AI optimists, Innovation advocates

Confidence: low (1/5)

Sources

METR Official Website↗ - Organization mission, research focus, and publications
About METR↗ - Organization background and team information
METR Research↗ - Full list of research outputs and evaluations
GPT-5 Autonomy Evaluation Report↗ - Comprehensive evaluation methodology and findings
GPT-5.1-Codex-Max Evaluation Report↗ - Latest frontier model assessment
Measuring AI Ability to Complete Long Tasks (arXiv:2503.14499)↗ - Time horizon measurement methodology
RE-Bench: Evaluating frontier AI R&D capabilities (arXiv:2411.15114)↗ - ML research engineering benchmark
MALT Dataset↗ - Reward hacking and sandbagging detection research
Common Elements of Frontier AI Safety Policies↗ - Analysis of 12 companies' safety frameworks
METR GPT-4.5 Pre-deployment Evaluations↗ - Pre-deployment evaluation methodology
AXRP Episode 34 - AI Evaluations with Beth Barnes↗ - In-depth discussion of METR's approach
Beth Barnes - Safety evaluations and standards for AI (EA Forum)↗ - EAG Bay Area 2023 talk
METR - Wikipedia↗ - Organization history and overview
TIME: Nobody Knows How to Safety-Test AI↗ - Journalism coverage of AI evaluation challenges

https://metr.org/blog/2024-10-09-new-support-through-the-audacious-project ↩
https://evaluations.metr.org ↩
https://metr.org/blog/2023-12-04-metr-announcement ↩
METR's evaluations directly influence deployment decisions at OpenAI, Anthropic, and Google DeepMind. ↩
METR's work has already prevented potentially dangerous deployments and established industry standards for pre-release safety evaluation. ↩
Paul Christiano established the Alignment Research Center (ARC) in 2021 with two divisions: Theory and Evaluations. ↩
https://evals.alignment.org/blog/2023-09-19-spin-out-announcement ↩
https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals ↩
Citation rc-6f00 ↩
https://arxiv.org/pdf/2305.15324 ↩
https://metr.org/blog/2023-12-04-metr-announcement ↩
https://metr.substack.com/p/2023-12-04-metr-announcement ↩
METR formalized contracts with all major frontier labs. ↩
METR played a crucial role in informing Google DeepMind's Frontier Safety Framework. ↩
METR played a crucial role in informing Anthropic's Responsible Scaling Policy. ↩
METR played a crucial role in informing OpenAI's Preparedness Framework. ↩
Deployment decisions at major labs are now contingent on METR's assessments. ↩
https://evals.alignment.org/about ↩
METR employs approximately 30 specialists. ↩
https://evals.alignment.org/language-model-pilot-report ↩
METR's autonomous replication assessments examine whether models can copy themselves to new infrastructure, acquire computational resources, obtain funding, and maintain operational security. ↩
https://ghong.site/papers/self_proliferation.pdf ↩
https://metr.org/blog/2024-08-06-update-on-evaluations ↩
METR's cybersecurity evaluations test vulnerability discovery, exploit development, social engineering, network penetration, persistence mechanisms, and coordinated attack campaigns. ↩
https://metr.org/blog/2025-01-31-update-sonnet-o1-evals ↩
METR's evaluation methodologies include capture-the-flag exercises, controlled real-world vulnerability testing, red-teaming of live systems with explicit permission, and direct comparison to human... ↩
Citation rc-02f4 ↩
METR's findings indicate that current frontier models demonstrate concerning capabilities in several cybersecurity domains. ↩
METR's evaluations suggest that frontier models are approaching the point where they could automate significant portions of the cyber attack lifecycle. ↩
METR's CBRN evaluations focus particularly on biological threats, given AI's potential to accelerate biotechnology research. ↩
https://metr.org/common-elements ↩
https://metr.org/common-elements ↩
https://metr.org/METR_ai_action_plan_comment.pdf ↩
METR's methodology for CBRN threat assessment includes expert elicitation, controlled testing of dangerous knowledge, and comparison to publicly available scientific literature. ↩
METR's current findings suggest that frontier models can provide assistance that goes beyond simple internet searches in CBRN domains. ↩
METR evaluates AI systems' capacity for psychological manipulation, deception, and large-scale persuasion. ↩
METR's assessments examine personalized persuasion techniques, misinformation generation, long-term relationship building, and exploitation of cognitive biases. ↩
METR uses controlled human studies, deception detection experiments, adversarial dialogue scenarios, and long-term interaction assessments to test manipulation capabilities. ↩
https://pnas.org/doi/10.1073/pnas.2403116121 ↩
METR's evaluations in the manipulation domain serve as crucial early warning systems for capabilities that could fundamentally alter the human-AI power balance. ↩
METR uses adversarial evaluation to test for worst-case behaviors. ↩
Citation rc-b10e ↩
https://metr.org/about ↩
METR's findings directly influence deployment decisions at partner organizations. ↩
https://evals.alignment.org/blog/2026-1-29-time-horizon-1-1 ↩
Citation rc-57a5 ↩
https://ailabwatch.substack.com/p/ai-companies-arent-really-using-external ↩
Labs facing competitive pressure to deploy might resist evaluations by METR that could delay releases or require costly safety measures. ↩
https://metr.org/about ↩
https://metr.org/about ↩
AI developers face increasing incentives to optimize systems specifically for passing evaluations rather than for genuine safety as METR's evaluations become more influential in deployment decisions. ↩
https://arxiv.org/html/2602.08449v1 ↩
https://evaluations.metr.org/elicitation-protocol ↩
https://jolt.law.harvard.edu/digest/ai-sandbagging-allocating-the-risk-of-loss-for-scheming-by-ai-systems ↩
https://arxiv.org/html/2502.06559v2 ↩
https://hodfords.com/blog/ai-safety-evaluations-a-flawed-system ↩
https://evaluations.metr.org/example-protocol ↩
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks ↩
https://metr.org/evaluations/gpt-5-report ↩
Current practices leave AI threshold-setting largely to individual labs, with METR providing input but not final authority. ↩
https://ailabwatch.org/blog/external-evaluation ↩
https://lawfaremedia.org/article/kicking-the-tires--a-voluntary-path-to-pre-deployment-ai-vetting ↩
METR is developing more sophisticated testing protocols for emerging risks, including multimodal AI systems that integrate text, image, and potentially action capabilities. ↩
Government integration represents a critical near-term priority for METR. ↩
https://metr.org/time-horizons ↩
Citation rc-d6b4 ↩
https://evals.alignment.org/blog/2026-01-19-early-work-on-monitorability-evaluations ↩
METR may need to develop entirely new assessment paradigms for evaluating AI systems that exceed human expert performance in dangerous capability domains. ↩
If breakthrough progress occurs in AI interpretability, METR might incorporate direct examination of model cognition rather than relying solely on behavioral assessment. ↩
https://fbi.org.au/blog/2026-03-07-australia-ai-regulatory-framework-voluntary-to-mandatory ↩
https://sentient-horizons.com/recognizing-agi-beyond-benchmarks-and-toward-a-three-axis-evaluation-of-mind ↩
https://foommagazine.org/is-research-into-recursive-self-improvement-becoming-a-safety-hazard ↩
https://fastcompany.com/91384819/what-is-self-evolving-ai-and-why-do-you-need-to-worry-about-it-now-ai-management ↩
METR's assumption of sufficient lead time for assessment and mitigation might not hold for rapidly self-modifying systems. ↩
https://evaluations.metr.org/claude-3-7-report ↩
The sustainability of current voluntary compliance frameworks under intense competitive pressure is a critical unknown for METR. ↩
The technical challenge of evaluating AI systems that exceed human expert performance in dangerous domains has no clear solution for METR. ↩

References

1Details about METR’s evaluation of OpenAI GPT-5.1-Codex-MaxMETR▸

This is METR's formal evaluation report for OpenAI's GPT-5.1-Codex-Max model, assessing its autonomous capabilities and potential for dangerous self-replication or agentic misuse. The report is part of METR's ongoing third-party evaluation work conducted under safety agreements with frontier AI labs. It provides empirical data on the model's performance on tasks relevant to autonomous replication and adaptation (ARA) threats.

★★★★☆

evaluations.metr.org

2Beth Barnes - Safety evaluations and standards for AI (EA Forum)EA Forum·Beth Barnes·2023▸

Beth Barnes of ARC Evals presents the case for high-effort, targeted safety evaluations as a promising intervention for reducing existential risk from AI. She focuses on the 'autonomous replication' threshold as a key danger signal and explains how evaluations can create accountability mechanisms that encourage labs to meet safety standards before deployment.

★★★☆☆

forum.effectivealtruism.org

3MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval IntegrityMETR▸

METR introduces MALT (Manually-reviewed Agentic Labeled Transcripts), a dataset of ~11,000 agent transcripts labeled for behaviors threatening evaluation integrity, including reward hacking and sandbagging, across 21 models and 403 tasks. The dataset enables validation of LLM-based monitoring systems, with best monitors achieving AUROCs of 0.96 and 0.91 for reward hacking and sandbagging detection respectively. Access to reasoning traces significantly improves sandbagging detection but not reward hacking detection.

★★★★☆

metr.org

4METR: Model Evaluation and Threat ResearchMETR▸

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

★★★★☆

metr.org

5Nobody Knows How to Safety-Test AITIME▸

A TIME article profiling METR (Model Evaluation and Threat Research) and the broader challenge of AI safety evaluations. It examines how researchers attempt to probe frontier AI systems for dangerous capabilities, highlighting that current evaluation methods are immature and the field lacks consensus on how to rigorously assess AI risks.

★★★☆☆

time.com

6Details about METR’s evaluation of OpenAI GPT-5METR▸

METR conducted an independent third-party evaluation of OpenAI's GPT-5 to assess catastrophic risk potential across three threat models: AI R&D automation, rogue replication, and strategic sabotage. The evaluation found GPT-5 has a 50% time-horizon of ~2 hours 17 minutes on agentic software engineering tasks, and concluded it does not currently pose catastrophic risks under these threat models. The report also assesses risks from incremental further development prior to public deployment.

★★★★☆

evaluations.metr.org

7AXRP Episode 34 - AI Evaluations with Beth Barnesaxrp.net▸

Beth Barnes, co-founder and head of research at METR, discusses how AI capability evaluations work, their limitations, and how they fit into safety policy. The conversation covers threat modeling, capability buffers, alignment evaluations, and how METR's work informs responsible scaling policies at AI labs.

axrp.net

8[2411.15114] RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human expertsarXiv·Hjalmar Wijk et al.·2024·Paper▸

RE-Bench is a new benchmark for evaluating AI agents' capabilities in research engineering tasks, consisting of 7 open-ended ML research environments with baseline data from 71 human expert attempts. The study finds that frontier AI models achieve 4x higher scores than human experts on a 2-hour time budget, but humans maintain an advantage with longer time budgets, exceeding top AI agents by 2x when given 32 hours. While AI agents demonstrate strong technical expertise and can generate solutions 10x faster than humans at lower cost, the benchmark reveals important differences in how humans and AI systems scale with additional time and resources.

★★★☆☆

arxiv.org

9METR (Model Evaluation & Threat Research) - AboutMETR▸

METR is an organization focused on evaluating AI models for dangerous capabilities, particularly autonomous replication and adaptation (ARA) risks. They develop evaluation frameworks and conduct assessments to determine whether frontier AI systems pose catastrophic risks before deployment. Their work informs AI safety policy and responsible scaling decisions at major AI labs.

★★★★☆

metr.org

10Evaluation MethodologyMETR▸

METR (Model Evaluation & Threat Research) develops rigorous methodologies for evaluating AI systems, focusing on assessing autonomous capabilities and potential risks from advanced AI models. Their work establishes frameworks for measuring dangerous capabilities including deception, autonomous replication, and other safety-relevant behaviors. METR's evaluations inform deployment decisions and safety thresholds for frontier AI labs.

★★★★☆

metr.org

11METR’s GPT-4.5 pre-deployment evaluationsMETR▸

METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general autonomous task completion. The evaluations found GPT-4.5 did not demonstrate concerning levels of autonomous replication or adaptation capabilities. This report is part of METR's ongoing third-party evaluation work supporting responsible AI deployment decisions.

★★★★☆

metr.org

12METR (Model Evaluation & Threat Research)Wikipedia·Reference▸

METR is an AI safety organization focused on evaluating frontier AI models for dangerous capabilities, particularly autonomous replication and adaptation (ARA) abilities. They develop standardized evaluations to assess whether AI systems pose catastrophic risks, and work with leading AI labs to conduct pre-deployment safety testing. Their work informs responsible scaling policies and deployment decisions.

★★★☆☆

en.wikipedia.org

13METR's analysis of 12 companiesMETR▸

METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced AI systems. The analysis synthesizes patterns across responsible scaling policies, model cards, and safety frameworks to provide a comparative overview of industry norms. It serves as a reference for understanding where consensus exists and where significant variation or absence of commitments remains.

★★★★☆

metr.org

14[2503.14499] Measuring AI Ability to Complete Long Software TasksarXiv·Thomas Kwa et al.·2025·Paper▸

This paper introduces the '50%-task-completion time horizon' metric, measuring AI capability in human-relatable terms as the time domain-expert humans need to complete tasks AI solves at 50% success rate. Current frontier models like Claude 3.7 Sonnet achieve roughly a 50-minute horizon, and this metric has doubled approximately every seven months since 2019. Extrapolating this trend suggests AI could automate month-long human software tasks within five years.

★★★☆☆

arxiv.org

15GPT-4 System CardOpenAI▸

OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations, red-teaming findings, and the RLHF-based safety interventions applied to reduce harmful outputs. The document represents OpenAI's public accountability framework for responsible deployment of a frontier AI model.

★★★★☆

cdn.openai.com

16Beth Barnes - Personal Homepagebarnes.page▸

Personal homepage of Beth Barnes, an AI safety researcher known for work on evaluations, dangerous capabilities assessments, and autonomous replication risks in AI systems. The page likely links to her research, projects, and professional background in the AI safety field.

barnes.page

17New Support Through The Audacious Project - METRMETR▸

METR announces approximately $17 million in funding (part of a ~$38M total) from the Audacious Project to support 'Canary,' a collaboration with RAND focused on developing and deploying evaluations to monitor AI systems for dangerous capabilities. METR will use these resources to assess frontier AI systems' autonomous agent capabilities and support companies and governments in running safety tests. This funding represents a significant institutional investment in empirical AI risk evaluation methodology.

★★★★☆

metr.org

18Measuring AI Ability to Complete Long Tasks - METRMETR▸

METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that models can successfully complete roughly doubling every 7 months. This 'task length' metric serves as a practical proxy for measuring real-world AI capability progression and agentic autonomy.

★★★★☆

metr.org

19Details about METR's preliminary evaluation of DeepSeek-V3METR▸

★★★★☆

evaluations.metr.org

20AISI Frontier AI TrendsUK AI Safety Institute·Government▸

A UK AI Safety Institute government assessment documenting exponential performance improvements across frontier AI systems in multiple domains. The report evaluates emerging capabilities and associated risks, calling for robust safeguards as systems advance rapidly. It serves as an official benchmark of the current frontier AI landscape from a national safety authority.

★★★★☆

aisi.gov.uk

Round	Date	Raised	Valuation	Lead Investor
SFF Grant 2024grant	2024	$204K	—	Survival And Flourishing Fund (Jaan Tallinn)
Audacious Project (Project Canary)grant	Oct 2024	$17M	—	The Audacious Project
SFF Grant 2025grant	2025	$120K	—	Survival And Flourishing Fund (Jaan Tallinn)

Member	Appointed	Role
Adam Gleave	—	Advisor and Board Member
JtfLbrwBY4	—	Advisor and Board Member
Alec Radford	—	Advisor
Yoshua Bengio	—	Advisor
zo0QGgyQBe	—	Advisor

METR

METR

Quick Assessment

Organization Details

Overview

History and Evolution

Origins as ARC Evals (2021-2023)

Transformation to METR (2023-2024)

Key Publications and Evaluations

METR's Role in the AI Safety Ecosystem

Comparative Model Performance on METR Evaluations

Core Evaluation Domains

Autonomous Replication and Resource Acquisition

Cybersecurity Capabilities Assessment

CBRN (Chemical, Biological, Radiological, Nuclear) Threat Assessment

Manipulation and Persuasion Capabilities

Technical Methodology

Evaluation Task Suite Structure

Methodology Framework

Integration with Safety Frameworks

Critical Analysis and Challenges

Evaluation Adequacy and Coverage

Independence and Organizational Sustainability

Evaluation Gaming and Capability Hiding

Threshold Definition and Risk Tolerance

Future Trajectory and Strategic Implications

Near-Term Development (1-2 Years)

Medium-Term Evolution (2-5 Years)

Long-Term Strategic Questions

Key Uncertainties and Research Questions

Key Questions

Perspectives on Evaluation-Based Safety

Role and Adequacy of Dangerous Capability Evaluations

Sources

Footnotes

References

Structured Data

Key People

Funding History

All Facts

Board Seats

Related Wiki Pages

Top Related Pages

Ajeya Cotra

UK AI Safety Institute

US AI Safety Institute (now CAISI)

Apollo Research

Capability Elicitation

Safety Research

Approaches

Analysis

Policy

Organizations

Other

Risks

Concepts

Key Debates