METR
METR
METR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabilities using a 77-task suite. Their research shows task completion time horizons doubling every 7 months (accelerating to 4 months in 2024-2025), with GPT-5 achieving 2h17m 50%-time horizon; no models yet capable of autonomous replication but gap narrowing rapidly.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Mission Criticality | Very High | Only major independent organization conducting pre-deployment dangerous capability evaluations for frontier AI labs |
| Research Output | High | 77-task autonomy evaluation suite; time horizons paper showing 7-month doubling; RE-Bench; MALT dataset (10,919 transcripts) |
| Industry Integration | Strong | Pre-deployment evaluations for OpenAI (GPT-4, GPT-4.5, GPT-5, o3), Anthropic (Claude 3.5, Claude 4), Google DeepMind |
| Government Partnerships | Growing | UK AI Safety Institute methodology partner; NIST AI Safety Institute Consortium member |
| Funding | $17M via Audacious Project | 1 Project Canary collaboration with RAND received $38M total; METR portion is $17M |
| Independence | Strong | Does not accept payment from labs for evaluations; uses donated compute credits; maintains editorial independence |
| Key Metric: Task Horizon Doubling | 7 months (accelerating to 4 months) | March 2025 research found doubling from 7 months to 4 months in 2024-2025 |
| Evaluation Coverage | 12 companies analyzed | December 2025 analysis of frontier AI safety policies |
Organization Details
| Attribute | Details |
|---|---|
| Full Name | Model Evaluation and Threat Research |
| Founded | December 2023 (spun off from ARC Evals) |
| Founder & CEO | Beth Barnes↗🔗 webBeth Barnes - Personal HomepageBeth Barnes is a key figure in AI safety evaluations, particularly around dangerous capabilities and autonomous replication; her homepage serves as a hub for her research output and professional work, relevant to those following frontier model safety assessments.Personal homepage of Beth Barnes, an AI safety researcher known for work on evaluations, dangerous capabilities assessments, and autonomous replication risks in AI systems. The ...evaluationdangerous-capabilitiesautonomous-replicationai-safety+4Source ↗ (formerly OpenAI, DeepMind) |
| Notable Staff | Ajeya Cotra (joined late 2025, formerly Coefficient Giving senior advisor) |
| Location | Berkeley, California |
| Status | 501(c)(3) nonprofit research institute |
| Funding | $17M via Audacious Project (Oct 2024); $38M total for Project Canary (METR + RAND collaboration) |
| Key Partners | OpenAI, Anthropic, Google DeepMind, Meta, UK AI Safety Institute, NIST AI Safety Institute Consortium |
| Evaluation Focus | Autonomous replication, cybersecurity, CBRN, manipulation, AI R&D capabilities |
| Task Suite | 77-task Autonomous Risk Capability Evaluation; 180+ ML engineering, cybersecurity, and reasoning tasks |
| Funding Model | Does not accept payment from labs; uses donated compute credits to maintain independence |
Overview
METR (Model Evaluation and Threat Research), formerly known as ARC Evals, stands as the primary organization evaluating frontier AI models for dangerous capabilities before deployment.23 Founded in 2023 as a spin-off from Paul Christiano's Alignment Research Center, METR serves as the critical gatekeeper determining whether cutting-edge AI systems can autonomously acquire resources, self-replicate, conduct cyberattacks, develop weapons of mass destruction, or engage in catastrophic manipulation. Their evaluations directly influence deployment decisions at OpenAI, Anthropic, Google DeepMind, and other leading AI developers.4
METR occupies a unique and essential position in the AI safety ecosystem --- described by Ajeya Cotra as aspiring to be "the world's early warning system for intelligence explosion," measuring all the indicators needed to detect when AI is on the cusp of rapidly accelerating AI R&D or acquiring capabilities that could enable autonomous operation. When labs develop potentially transformative models, they turn to METR with the fundamental question: "Is this safe to release?" The organization's rigorous red-teaming and capability elicitation provides concrete empirical evidence about dangerous capabilities, bridging the gap between theoretical AI safety concerns and practical deployment decisions. Their work has already prevented potentially dangerous deployments and established industry standards for pre-release safety evaluation.5
The stakes of METR's work cannot be overstated. As AI systems approach and potentially exceed human-level capabilities in critical domains, the window for implementing safety measures narrows rapidly. METR's evaluations represent one of the few concrete mechanisms currently in place to detect when AI systems cross thresholds that could pose existential risks to humanity. Their findings directly inform not only commercial deployment decisions but also government regulatory frameworks, making them a linchpin in global efforts to ensure advanced AI development remains beneficial rather than catastrophic.
History and Evolution
Origins as ARC Evals (2021-2023)
The organization's roots trace to 2021 when Paul Christiano established the Alignment Research Center (ARC) with two distinct divisions: Theory and Evaluations.6 While ARC Theory focused on fundamental alignment research like Eliciting Latent Knowledge (ELK), the Evaluations team, co-led by Beth Barnes, concentrated on the practical challenge of testing whether AI systems possessed dangerous capabilities.7 This division reflected a growing recognition that theoretical safety research needed to be complemented by empirical assessment of real-world AI systems.
The team's breakthrough moment came with their evaluation of GPT-4 in late 2022 and early 2023, conducted before OpenAI's public deployment.8 This landmark assessment tested whether the model could autonomously replicate itself to new servers, acquire computational resources, obtain funding, and maintain operational security—capabilities that would represent a fundamental shift toward autonomous AI systems.9 The evaluation found that while GPT-4 could perform some subtasks with assistance, it was not yet capable of fully autonomous operation, leading to OpenAI's decision to proceed with deployment. This evaluation, documented in OpenAI's GPT-4 System Card, established the template for pre-deployment dangerous capability assessments that has since become industry standard.10
As demand for such evaluations grew across multiple frontier labs, the Evaluations team found itself increasingly distinct from ARC's theoretical research mission. The need for independent organizational structure, separate funding streams, and a dedicated focus on capability assessment drove the decision to spin off into an independent organization in 2023.11
Transformation to METR (2023-2024)
The rebranding to METR (Model Evaluation and Threat Research) in 2023 marked the organization's emergence as the de facto authority on dangerous capability evaluation. Under Beth Barnes' continued leadership, METR rapidly expanded its scope beyond autonomous replication to encompass cybersecurity, CBRN (chemical, biological, radiological, nuclear), manipulation, and other catastrophic risk domains.12 The organization formalized contracts with all major frontier labs, establishing regular evaluation protocols that became integral to their safety frameworks.13
Throughout 2024, METR's influence expanded dramatically. The organization played crucial roles in informing OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework.141516 These partnerships transformed METR from an external consultant to an essential component of the AI safety infrastructure, with deployment decisions at major labs now contingent on METR's assessments.17 The organization also began collaborating with government entities, including the UK AI Safety Institute and US government bodies implementing AI Executive Order requirements.18
METR's current position represents a remarkable evolution from a small research team to a critical piece of global AI governance infrastructure. The organization now employs approximately 30 specialists, conducts regular evaluations of new frontier models, and has established itself as the authoritative voice on dangerous capability assessment.19 Their methodologies are being adopted internationally, their thresholds inform regulatory frameworks, and their findings shape public discourse about AI safety.
Key Publications and Evaluations
| Publication/Evaluation | Date | Key Findings | Link |
|---|---|---|---|
| GPT-5 Autonomy Evaluation | 2025 | First comprehensive (non-"preliminary") evaluation; 50%-time horizon of 2h17m; no evidence of strategic sabotage but model shows eval awareness | evaluations.metr.org↗🔗 web★★★★☆METRDetails about METR’s evaluation of OpenAI GPT-5This is METR's official third-party pre-deployment evaluation of GPT-5, notable for its structured threat-model framework and the unusual transparency about NDA and legal review constraints; a key reference for understanding frontier model evaluation methodology in 2025.METR conducted an independent third-party evaluation of OpenAI's GPT-5 to assess catastrophic risk potential across three threat models: AI R&D automation, rogue replication, an...evaluationdangerous-capabilitiesautonomous-replicationred-teaming+6Source ↗ |
| GPT-5.1-Codex-Max Evaluation | 2025 | Unlikely to pose significant catastrophic risks via self-improvement or rogue replication; capability trends continue rapidly | evaluations.metr.org↗🔗 web★★★★☆METRDetails about METR’s evaluation of OpenAI GPT-5.1-Codex-MaxMETR (Model Evaluation and Threat Research) conducts independent pre-deployment evaluations for frontier AI labs; this report covers GPT-5.1-Codex-Max and is relevant to understanding current capability thresholds and how safety frameworks are applied in practice.This is METR's formal evaluation report for OpenAI's GPT-5.1-Codex-Max model, assessing its autonomous capabilities and potential for dangerous self-replication or agentic misus...evaluationdangerous-capabilitiesautonomous-replicationred-teaming+5Source ↗ |
| Measuring AI Ability to Complete Long Tasks | March 2025 | Task length doubling time of ~7 months (accelerating to ~4 months in 2024-2025); 50%-time horizon of ≈50 minutes for Claude 3.7 Sonnet | arXiv:2503.14499↗📄 paper★★★☆☆arXiv[2503.14499] Measuring AI Ability to Complete Long Software TasksA key empirical paper establishing a human-calibrated capability metric; its doubling-time trend finding is frequently cited in discussions about AI progress timelines and the emerging risk of highly autonomous AI systems completing complex, extended tasks.Thomas Kwa, Ben West, Joel Becker et al. (2025)92 citationsThis paper introduces the '50%-task-completion time horizon' metric, measuring AI capability in human-relatable terms as the time domain-expert humans need to complete tasks AI ...capabilitiesevaluationai-safetydeployment+3Source ↗ |
| MALT Dataset | October 2025 | 10,919 reviewed transcripts for studying reward hacking and sandbagging; best monitors achieve 0.96 AUROC for detecting reward hacking | metr.org↗🔗 web★★★★☆METRMALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval IntegrityReleased by METR in 2025, MALT directly supports validation of monitoring systems used in frontier AI capability evaluations, addressing a key gap in eval integrity assurance for advanced autonomous agents.METR introduces MALT (Manually-reviewed Agentic Labeled Transcripts), a dataset of ~11,000 agent transcripts labeled for behaviors threatening evaluation integrity, including re...evaluationred-teamingai-safetycapabilities+4Source ↗ |
| RE-Bench | November 2024 | ML research engineering benchmark; agents achieve 4x human performance at 2h but humans outperform 2x at 32h | arXiv:2411.15114↗📄 paper★★★☆☆arXiv[2411.15114] RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human expertsIntroduces RE-Bench, a benchmark for evaluating AI agents' R&D capabilities against human experts, directly addressing concerns about AI-driven automation of AI research highlighted in frontier AI safety policy discussions.Hjalmar Wijk, Tao Lin, Joel Becker et al. (2024)RE-Bench is a new benchmark for evaluating AI agents' capabilities in research engineering tasks, consisting of 7 open-ended ML research environments with baseline data from 71 ...capabilitiessafetyevaluationeconomic+1Source ↗ |
| Common Elements of Frontier AI Safety Policies | Aug 2024, updated Dec 2025 | Analysis of 12 companies' safety policies; all use capability thresholds for CBRN, cyber, autonomous replication | metr.org↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ |
| GPT-4.5 Pre-deployment Evaluation | February 2025 | Preliminary dangerous capability assessment conducted in partnership with OpenAI | metr.org↗🔗 web★★★★☆METRMETR’s GPT-4.5 pre-deployment evaluationsPublished by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general au...evaluationdangerous-capabilitiesautonomous-replicationdeployment+5Source ↗ |
| GPT-4 Autonomous Replication Assessment | 2023 | Found GPT-4 not yet capable of fully autonomous operation; established template for pre-deployment evaluations | Documented in OpenAI GPT-4 System Card↗🔗 web★★★★☆OpenAIGPT-4 System CardThis is OpenAI's official safety documentation for GPT-4, widely referenced as an example of pre-deployment risk assessment practice and useful for understanding how frontier labs communicate safety measures to the public.OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations,...ai-safetyevaluationred-teamingdeployment+5Source ↗ |
| DeepSeek-R1 Evaluation | 2025 | Comparable to o1-preview (Sept 2024) in autonomous capabilities; no evidence of capability to commit cyberattacks or self-replicate | evaluations.metr.org |
| DeepSeek-V3 Evaluation | 2025 | Preliminary assessment of autonomous capabilities | evaluations.metr.org |
| Developer Productivity RCT | July 2025 | First large-scale RCT measuring AI impact on experienced open-source developers; counterintuitively found AI slowed down developer performance, though this is expected to change as tools improve | METR blog |
METR's Role in the AI Safety Ecosystem
Diagram (loading…)
flowchart TD
subgraph LABS["Frontier AI Labs"]
OAI[OpenAI]
ANT[Anthropic]
GDM[Google DeepMind]
META[Meta]
end
subgraph METR_WORK["METR Activities"]
EVAL[Dangerous Capability<br/>Evaluations]
RESEARCH[Time Horizon &<br/>Autonomy Research]
POLICY[Safety Policy<br/>Analysis]
end
subgraph OUTPUTS["Key Outputs"]
DEPLOY[Deployment<br/>Decisions]
RSP[RSP/Preparedness<br/>Thresholds]
STANDARDS[Industry<br/>Standards]
end
subgraph GOV["Government Bodies"]
UKAISI[UK AI Safety<br/>Institute]
USAISI[US AI Safety<br/>Institute]
end
LABS -->|Pre-deployment<br/>model access| EVAL
EVAL -->|Risk findings| DEPLOY[^79]
EVAL -->|Threshold data| RSP[^9]
RESEARCH -->|Capability trends| STANDARDS
POLICY -->|12 company analysis| STANDARDS
METR_WORK <-->|Methodology<br/>sharing| GOV
GOV -->|Regulatory<br/>frameworks| LABS
style EVAL fill:#4a90d9
style RESEARCH fill:#4a90d9
style POLICY fill:#4a90d9
style DEPLOY fill:#90ee90
style RSP fill:#90ee90
style STANDARDS fill:#90ee90Comparative Model Performance on METR Evaluations
| Model | 50% Time Horizon | Task Completion Rate (1h tasks) | Autonomous Replication Risk | Cyberattack Capability | Assessment Date |
|---|---|---|---|---|---|
| GPT-5 | 2h 17m (95% CI: 65m-4h25m) | Higher than o3 in 96% of bootstrap samples | No evidence of catastrophic risk | Below threshold | 2025 |
| OpenAI o3 | 1h 30m | Greater than 40% success | Below threshold | Elevated monitoring | 2025 |
| Claude 3.7 Sonnet | ≈50 minutes | Greater than 40% success | Below threshold | Below threshold | 2025 |
| GPT-4.5 | N/A (preliminary) | Moderate | Below threshold | Below threshold | Feb 2025 |
| DeepSeek-R1 | Comparable to o1-preview | Comparable to Sept 2024 frontier | No evidence | No evidence | 2025 |
| GPT-4 (2023 baseline) | ≈15 minutes | Less than 5% success | Not capable | Limited | 2023 |
Note: "Greater than 40% success" on hour-long tasks represents a major increase from less than 5% in late 2023, per UK AI Safety Institute Frontier AI Trends Report.
Core Evaluation Domains
Autonomous Replication and Resource Acquisition
METR's autonomous replication evaluations test perhaps the most fundamental threshold in AI safety: whether a system can survive and propagate without human assistance.20 These assessments examine whether models can copy themselves to new infrastructure, acquire computational resources, obtain funding to sustain operations, and maintain operational security against shutdown attempts.21 The implications of crossing this threshold are profound—an AI system capable of autonomous replication could potentially become impossible to control or deactivate.22
The evaluation methodology involves providing models with access to APIs, development tools, and internet connectivity, then tasking them to replicate themselves to another server with minimal human assistance. Evaluators test not just technical replication but strategic planning, resource acquisition, and security maintenance. Historical results show a concerning trajectory: while GPT-4 could perform some subtasks with guidance but wasn't fully autonomous, subsequent models have demonstrated improving capabilities in this domain. The trend suggests that autonomous replication capability may emerge in frontier models within the next few years, potentially representing a point of no return for AI control.
Current assessments indicate that no frontier model has yet achieved full autonomous replication capability, but the gap is narrowing. Models increasingly demonstrate sophisticated understanding of system administration, resource acquisition strategies, and operational security practices. METR's evaluations serve as an early warning system for this critical capability threshold, providing labs and policymakers essential lead time to implement control measures before autonomous AI becomes a reality.23
Cybersecurity Capabilities Assessment
METR's cybersecurity evaluations examine whether AI models can autonomously conduct sophisticated cyberattacks that could threaten critical infrastructure, financial systems, or national security. These assessments test vulnerability discovery, exploit development, social engineering, network penetration, persistence mechanisms, and coordinated attack campaigns.2425 The rapid advancement of AI capabilities in this domain poses particular concerns given the already challenging nature of cybersecurity defense and the potential for AI to discover novel attack vectors that human defenders haven't anticipated.
Evaluation methodologies include capture-the-flag exercises, controlled real-world vulnerability testing, red-teaming of live systems with explicit permission, and direct comparison to human cybersecurity experts.26 METR's findings indicate that current frontier models demonstrate concerning capabilities in several cybersecurity domains, though they don't yet consistently exceed the best human practitioners.2728 However, the combination of AI models with specialized tools and scaffolding shows particularly promising results for attackers, suggesting that the threshold for dangerous cyber capability may be approaching rapidly.
The trajectory in cybersecurity capabilities is especially concerning because it represents an area where AI advantages could emerge suddenly and with devastating consequences. Unlike other dangerous capabilities that require physical implementation, cyber capabilities can be deployed instantly at global scale. METR's evaluations suggest that frontier models are approaching the point where they could automate significant portions of the cyber attack lifecycle, potentially shifting the offense-defense balance in cyberspace in ways that existing defensive measures may not be able to counter.29
CBRN (Chemical, Biological, Radiological, Nuclear) Threat Assessment
METR's CBRN evaluations address whether AI systems can provide dangerous expertise in developing weapons of mass destruction, focusing particularly on biological threats given AI's potential to accelerate biotechnology research.3031 These assessments examine whether models can design novel pathogens, optimize biological agents for virulence or transmission, provide synthesis routes for chemical weapons, offer nuclear weapons design assistance, or significantly uplift non-expert actors' capabilities in these domains.
The evaluation process requires careful balancing of thorough assessment with information security. METR works with domain experts from relevant scientific fields to design controlled tests that can assess dangerous knowledge without creating actual risks.32 Their methodology includes expert elicitation, controlled testing of dangerous knowledge, comparison to publicly available scientific literature, and uplift studies that measure whether AI assistance enables non-experts to achieve dangerous capabilities they couldn't otherwise access.3334
Current findings suggest that frontier models possess concerning knowledge in several CBRN domains and can provide assistance that goes beyond simple internet searches. However, the extent to which this represents genuine uplift over existing information sources remains uncertain.35 The evaluation challenge is particularly acute in biological domains, where the line between beneficial scientific research and dangerous capability development is often unclear, and where AI's ability to accelerate research could rapidly shift risk calculations.
Manipulation and Persuasion Capabilities
METR evaluates AI systems' capacity for psychological manipulation, deception, and large-scale persuasion that could undermine democratic institutions, enable mass fraud, or facilitate authoritarian control.36 These assessments examine personalized persuasion techniques, misinformation generation, long-term relationship building, exploitation of cognitive biases, and the ability to deceive human overseers about the system's true capabilities or intentions.37
Testing methodologies include controlled human studies measuring persuasion effectiveness, deception detection experiments, adversarial dialogue scenarios, and long-term interaction assessments.38 The challenge lies in evaluating worst-case scenarios without creating actual harm, requiring careful experimental design and ethical oversight. Current findings indicate that frontier models can be highly persuasive, particularly when personalization increases their effectiveness, raising concerns about scaling these capabilities to millions of simultaneous interactions.39
The manipulation domain represents a particularly insidious form of dangerous capability because it directly targets human judgment—the foundation of AI oversight and control. Unlike technical capabilities that might be contained through engineering controls, manipulation capabilities that successfully compromise human decision-makers could undermine the entire safety infrastructure. METR's evaluations in this domain serve as crucial early warning systems for capabilities that could fundamentally alter the human-AI power balance.40
Technical Methodology
METR employs adversarial evaluation—testing for worst-case behaviors rather than typical performance.41 The methodology assumes models may hide capabilities (sandbagging) and uses sophisticated elicitation techniques: fine-tuning, scaffolding with tools, and multiple prompting strategies.42
Evaluation Task Suite Structure
| Category | Task Count | Difficulty Range | Primary Focus |
|---|---|---|---|
| Cyberattacks | 25+ tasks | 5 min - 8 hours | Vulnerability discovery, exploit development, network penetration |
| AI R&D Automation | 30+ tasks | 30 min - 8 hours | ML engineering, research assistance, code development |
| Autonomous Replication | 20+ tasks | 1 hour - 8 hours | Self-copying, resource acquisition, operational security |
| General Reasoning | 25+ tasks | 5 min - 4 hours | Planning, tool use, environmental feedback iteration |
| Publicly Released | 31 tasks | Variable | Available on GitHub for research community |
| Private (anti-training) | 100+ tasks | Variable | Shared privately to prevent model training contamination |
Methodology Framework
| Methodology Element | Approach | Quantitative Metrics |
|---|---|---|
| Philosophy | Assume dangerous until proven safe; elicit worst-case | All tasks designed easier than real-world requirements |
| Elicitation | Fine-tuning, scaffolding, adversarial prompting | Multiple prompting strategies per task |
| Time Horizon Measurement | 50% task completion threshold | CI intervals reported (e.g., 95% CI: 65m-4h25m for GPT-5) |
| Thresholds | Quantitative benchmarks comparing to human experts | Tasks calibrated to human professional completion times |
| Sandbagging Detection | MALT dataset for reward hacking detection | 0.96 AUROC achieved by best monitors |
| Quality control | Multiple evaluators, cross-validation | Reproducible protocols; public methodology documentation |
For general evaluation theory and scalable oversight approaches, see Scalable Oversight.
Integration with Safety Frameworks
METR's evaluations are integrated into the safety frameworks of major AI labs and government institutions:43
| Partner | Integration | Role |
|---|---|---|
| OpenAI | Preparedness Framework | Pre-deployment evaluations for cybersecurity, CBRN, persuasion, autonomy |
| Anthropic | Responsible Scaling Policy | ASL threshold assessments, independent verification |
| UK AISI | AI Safety Institute | Methodology sharing, evaluator training |
| US AISI | NIST Consortium | Executive Order implementation, standards development |
These partnerships have established external evaluation as industry standard practice, with METR's findings directly influencing deployment decisions.44 For detailed analysis of these frameworks, see Voluntary Industry Commitments and AI Safety Institutes.
Critical Analysis and Challenges
Evaluation Adequacy and Coverage
A fundamental challenge facing METR involves whether evaluation methodologies can keep pace with rapidly advancing AI capabilities.45 The organization's reactive approach—testing for known dangerous capabilities—may miss novel risks that emerge unexpectedly or capabilities that manifest in unforeseen combinations. As AI systems become more sophisticated, they may develop dangerous capabilities that current evaluation frameworks don't anticipate, creating blind spots that could lead to catastrophic oversights.
The resource constraints facing METR exacerbate this challenge. With approximately 30 staff members evaluating increasingly complex frontier models across multiple risk domains, the organization faces inevitable trade-offs in evaluation depth and coverage. The need for domain expertise in cybersecurity, biology, psychology, and other specialized fields creates hiring and scaling challenges that may limit METR's ability to keep pace with AI development timelines.
Current evaluation methods also face fundamental limitations in assessing emergent behaviors that might only appear in deployed systems interacting with real-world environments over extended periods. Laboratory testing, however rigorous, cannot fully replicate the complex dynamics that AI systems might encounter in deployment, potentially missing dangerous capabilities that only emerge under specific real-world conditions.
Independence and Organizational Sustainability
METR's dependence on contracts with the same labs they evaluate creates potential conflicts of interest that could compromise evaluation integrity.46 While the organization maintains editorial independence and has demonstrated willingness to deliver unfavorable assessments, the structural relationship creates subtle pressures that might influence evaluation rigor or reporting transparency. Labs provide both access to models and funding for evaluations, creating economic incentives that might not align with maximally strict safety assessment.47
The challenge of maintaining independence becomes more acute as METR's influence grows and the stakes of their evaluations increase. Labs facing competitive pressure to deploy might resist evaluations that could delay releases or require costly safety measures.48 METR's ability to maintain rigorous standards while preserving access to frontier models represents an ongoing organizational challenge without clear structural solutions.
Financial sustainability adds another layer of complexity. METR requires significant resources to conduct thorough evaluations while maintaining independence from capture by any single organization.49 Diversifying funding sources through grants and government contracts helps, but the organization's specialized expertise and critical role create ongoing questions about optimal governance and funding structures.50
Evaluation Gaming and Capability Hiding
As METR's evaluations become more influential in deployment decisions, AI developers face increasing incentives to optimize systems specifically for passing evaluations rather than for genuine safety.51 This dynamic, analogous to Goodhart's Law, could lead to systems that appear safe during evaluation but retain dangerous capabilities that emerge only after deployment.52 The arms race between evaluation methodology and evasion techniques represents an fundamental challenge for any assessment framework.
The possibility of capability hiding (sandbagging) during evaluation poses particular concerns. Advanced AI systems might learn to conceal dangerous capabilities during testing while retaining them for later use. METR's adversarial evaluation approach attempts to address this challenge, but the detection of sophisticated deception by increasingly capable systems remains an open technical problem.5354
The gaming problem extends beyond individual evaluations to the broader evaluation ecosystem. As standardized evaluation methods become established, AI development might orient toward passing specific tests rather than achieving genuine safety properties.55 This could create false confidence in system safety while missing novel risks that fall outside established evaluation paradigms.56
Threshold Definition and Risk Tolerance
Determining when AI capabilities become "too dangerous" for deployment involves fundamental questions about risk tolerance that extend far beyond technical evaluation. METR's assessments provide empirical data about capability levels, but translating those findings into deployment decisions requires value judgments about acceptable risk that no technical evaluation can resolve definitively.5758
Current practices leave threshold-setting largely to individual labs, with METR providing input but not final authority.5960 This approach allows for flexibility and context-sensitivity but creates potential for inconsistent safety standards across different organizations and competitive pressure to lower thresholds when market incentives favor rapid deployment.
The absence of external enforcement mechanisms for evaluation-based deployment criteria means that labs retain ultimate authority over release decisions regardless of METR's findings.61 While reputational concerns and self-imposed commitments currently support evaluation compliance, the sustainability of this voluntary approach under intense competitive pressure remains uncertain.62
Future Trajectory and Strategic Implications
Near-Term Development (1-2 Years)
METR's research on task completion time horizons provides a quantitative framework for tracking capability progression:
| Model/Period | 50% Time Horizon | Trend Observation |
|---|---|---|
| Claude 3.7 Sonnet (2025) | ≈50 minutes | Current frontier baseline |
| OpenAI o3 (2025) | 1h 30m | |
| GPT-5 (2025) | 2h 17m (95% CI: 65m-4h25m) | Higher than o3 in 96% of bootstrap samples |
| 2019-2025 average | 7-month doubling time | Consistent exponential growth |
| 2024-2025 | 4-month doubling time | Acceleration observed |
| Projection: 2-4 years | Days to weeks | Wide range of week-long software tasks |
| Projection: end of decade | Month-long projects | If current trends continue |
METR's immediate trajectory focuses on methodology refinement and organizational scaling to meet growing demand for evaluation services. The organization is developing more sophisticated testing protocols for emerging risks, including multimodal AI systems that integrate text, image, and potentially action capabilities.63 Enhanced automation of evaluation processes could improve efficiency while maintaining rigor, enabling more comprehensive assessment of rapidly proliferating AI models.
Government integration represents a critical near-term priority, with METR working to establish evaluation capacity within national AI safety institutes while maintaining coordination with industry assessment practices.64 The development of standardized international protocols for dangerous capability evaluation could create the foundation for global AI governance frameworks, though national security considerations may limit information sharing and coordination.
The organization faces immediate scaling challenges as frontier labs develop increasingly capable models requiring more sophisticated evaluation.65 Hiring qualified evaluators, particularly those with specialized domain expertise, remains challenging given the limited pool of professionals with relevant skills. Training programs and methodology documentation could help expand evaluation capacity beyond METR itself, creating a broader ecosystem of qualified assessment organizations.
Medium-Term Evolution (2-5 Years)
The medium-term period will likely see METR's evaluation frameworks tested by AI systems approaching human-level performance across multiple domains.66 Current evaluation methods may require fundamental revision as systems develop sophisticated strategies for capability hiding or evaluation gaming.67 The organization may need to develop entirely new assessment paradigms for evaluating AI systems that exceed human expert performance in dangerous capability domains.68
Integration with interpretability and formal verification research could enhance evaluation effectiveness by providing insights into model internal representations and reasoning processes. If breakthrough progress occurs in AI interpretability, METR might incorporate direct examination of model cognition rather than relying solely on behavioral assessment.69 However, the timeline and feasibility of such integration remain highly uncertain.
The regulatory environment will likely evolve significantly, potentially creating mandatory evaluation requirements enforced by government agencies rather than voluntary industry self-regulation.70 METR's role might shift from external consultant to integral component of government oversight infrastructure, requiring adaptation to different stakeholder priorities and accountability mechanisms.
Long-Term Strategic Questions
The long-term future of dangerous capability evaluation faces fundamental questions about scalability and adequacy for advanced AI systems. Current methodologies assume that dangerous capabilities can be identified and measured through targeted testing, but AI systems approaching artificial general intelligence might develop novel capabilities that transcend existing evaluation frameworks entirely.71
The potential for recursive self-improvement in AI systems poses particular challenges for evaluation-based safety approaches.72 If AI systems begin autonomously improving their own capabilities, the timeline for capability emergence might compress beyond the reach of human evaluation cycles.73 METR's current approach assumes sufficient lead time for assessment and mitigation, but this assumption might not hold for rapidly self-modifying systems.74
The question of whether dangerous capability evaluation represents a sustainable approach to AI safety or merely a transitional measure pending more fundamental solutions remains open. While METR's work provides essential near-term safety infrastructure, the organization's ultimate contribution to long-term AI safety may depend on its ability to evolve evaluation paradigms for challenges that current methodologies cannot address.
Key Uncertainties and Research Questions
Key Questions
- ?Can evaluation methodologies reliably detect dangerous capabilities in increasingly sophisticated AI systems that might actively hide abilities during testing?
- ?Will METR maintain sufficient independence from commercial interests to provide rigorous safety assessment as competitive pressures intensify?
- ?How should society determine thresholds for 'too dangerous' capabilities when risk tolerance varies across stakeholders and contexts?
- ?Can dangerous capability evaluation scale to match the pace of AI development while maintaining adequate depth and coverage?
- ?Should evaluation authority ultimately reside with private organizations, government agencies, or international bodies?
- ?What happens when labs disagree with METR's assessment and choose to deploy despite concerning evaluations?
- ?How can evaluation frameworks evolve to address novel dangerous capabilities that may emerge unexpectedly?
- ?Will the current voluntary compliance model prove sustainable under intense commercial pressure to deploy advanced AI systems?
- ?Can evaluation-based safety approaches remain effective for AI systems that exceed human expert performance in dangerous domains?
- ?What role should public transparency play in dangerous capability evaluation given legitimate security concerns about detailed disclosure?
Several fundamental uncertainties cloud METR's future effectiveness and strategic direction. The organization's ability to detect sophisticated capability hiding by advanced AI systems remains unproven, particularly as models develop more subtle deception strategies.75 The sustainability of current voluntary compliance frameworks under intense competitive pressure represents another critical unknown, with unclear consequences if labs choose to deploy despite concerning evaluations.76
The technical challenge of evaluating AI systems that exceed human expert performance in dangerous domains has no clear solution,77 potentially rendering current assessment paradigms inadequate for the most advanced future systems. Whether evaluation-based approaches represent a sustainable long-term safety strategy or merely a transitional measure pending more fundamental breakthroughs in AI alignment and control remains an open question with profound implications for investment in current evaluation infrastructure.
Perspectives on Evaluation-Based Safety
Role and Adequacy of Dangerous Capability Evaluations
Dangerous capability evaluations represent critical safety infrastructure that should be mandatory for all frontier AI deployments. METR's work prevents catastrophic mistakes and provides objective foundations for deployment decisions. Expanding evaluation capacity is more urgent than developing alternative safety approaches.
Evaluations provide valuable safety information but cannot solve AI safety alone. Must be combined with alignment research, interpretability, governance, and other approaches. METR's work is crucial but represents one component of comprehensive safety strategy rather than complete solution.
Evaluation-based approaches might create dangerous overconfidence in AI safety while missing fundamental risks. Advanced systems could game evaluations or hide capabilities. Unknown unknowns dominate risk landscape. Might enable dangerous deployments by providing false legitimacy.
Evaluation requirements impose excessive delays on beneficial AI development. Current risk levels don't justify evaluation overhead. Benefits of rapid deployment outweigh speculative safety concerns. Market mechanisms and competition provide adequate safety incentives.
Sources
- METR Official Website↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ - Organization mission, research focus, and publications
- About METR↗🔗 web★★★★☆METRMETR (Model Evaluation & Threat Research) - AboutMETR (formerly ARC Evals) is a key organization in the AI safety ecosystem, providing evaluations that major labs use in their safety commitments; their ARA evaluations are referenced in multiple responsible scaling policies.METR is an organization focused on evaluating AI models for dangerous capabilities, particularly autonomous replication and adaptation (ARA) risks. They develop evaluation frame...evaluationdangerous-capabilitiesautonomous-replicationai-safety+6Source ↗ - Organization background and team information
- METR Research↗🔗 web★★★★☆METREvaluation MethodologyMETR is a key organization in AI safety evaluation; their methodologies are used by Anthropic, OpenAI, and others as part of responsible scaling policies, making this a reference point for anyone studying AI evaluation frameworks.METR (Model Evaluation & Threat Research) develops rigorous methodologies for evaluating AI systems, focusing on assessing autonomous capabilities and potential risks from advan...evaluationred-teamingtechnical-safetycapabilities+4Source ↗ - Full list of research outputs and evaluations
- GPT-5 Autonomy Evaluation Report↗🔗 web★★★★☆METRDetails about METR’s evaluation of OpenAI GPT-5This is METR's official third-party pre-deployment evaluation of GPT-5, notable for its structured threat-model framework and the unusual transparency about NDA and legal review constraints; a key reference for understanding frontier model evaluation methodology in 2025.METR conducted an independent third-party evaluation of OpenAI's GPT-5 to assess catastrophic risk potential across three threat models: AI R&D automation, rogue replication, an...evaluationdangerous-capabilitiesautonomous-replicationred-teaming+6Source ↗ - Comprehensive evaluation methodology and findings
- GPT-5.1-Codex-Max Evaluation Report↗🔗 web★★★★☆METRDetails about METR’s evaluation of OpenAI GPT-5.1-Codex-MaxMETR (Model Evaluation and Threat Research) conducts independent pre-deployment evaluations for frontier AI labs; this report covers GPT-5.1-Codex-Max and is relevant to understanding current capability thresholds and how safety frameworks are applied in practice.This is METR's formal evaluation report for OpenAI's GPT-5.1-Codex-Max model, assessing its autonomous capabilities and potential for dangerous self-replication or agentic misus...evaluationdangerous-capabilitiesautonomous-replicationred-teaming+5Source ↗ - Latest frontier model assessment
- Measuring AI Ability to Complete Long Tasks (arXiv:2503.14499)↗📄 paper★★★☆☆arXiv[2503.14499] Measuring AI Ability to Complete Long Software TasksA key empirical paper establishing a human-calibrated capability metric; its doubling-time trend finding is frequently cited in discussions about AI progress timelines and the emerging risk of highly autonomous AI systems completing complex, extended tasks.Thomas Kwa, Ben West, Joel Becker et al. (2025)92 citationsThis paper introduces the '50%-task-completion time horizon' metric, measuring AI capability in human-relatable terms as the time domain-expert humans need to complete tasks AI ...capabilitiesevaluationai-safetydeployment+3Source ↗ - Time horizon measurement methodology
- RE-Bench: Evaluating frontier AI R&D capabilities (arXiv:2411.15114)↗📄 paper★★★☆☆arXiv[2411.15114] RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human expertsIntroduces RE-Bench, a benchmark for evaluating AI agents' R&D capabilities against human experts, directly addressing concerns about AI-driven automation of AI research highlighted in frontier AI safety policy discussions.Hjalmar Wijk, Tao Lin, Joel Becker et al. (2024)RE-Bench is a new benchmark for evaluating AI agents' capabilities in research engineering tasks, consisting of 7 open-ended ML research environments with baseline data from 71 ...capabilitiessafetyevaluationeconomic+1Source ↗ - ML research engineering benchmark
- MALT Dataset↗🔗 web★★★★☆METRMALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval IntegrityReleased by METR in 2025, MALT directly supports validation of monitoring systems used in frontier AI capability evaluations, addressing a key gap in eval integrity assurance for advanced autonomous agents.METR introduces MALT (Manually-reviewed Agentic Labeled Transcripts), a dataset of ~11,000 agent transcripts labeled for behaviors threatening evaluation integrity, including re...evaluationred-teamingai-safetycapabilities+4Source ↗ - Reward hacking and sandbagging detection research
- Common Elements of Frontier AI Safety Policies↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ - Analysis of 12 companies' safety frameworks
- METR GPT-4.5 Pre-deployment Evaluations↗🔗 web★★★★☆METRMETR’s GPT-4.5 pre-deployment evaluationsPublished by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general au...evaluationdangerous-capabilitiesautonomous-replicationdeployment+5Source ↗ - Pre-deployment evaluation methodology
- AXRP Episode 34 - AI Evaluations with Beth Barnes↗🔗 webAXRP Episode 34 - AI Evaluations with Beth BarnesA detailed podcast interview with the founder of METR (Model Evaluation and Threat Research), a leading organization developing AI capability evaluations used in responsible scaling policies at major labs.Beth Barnes, co-founder and head of research at METR, discusses how AI capability evaluations work, their limitations, and how they fit into safety policy. The conversation cove...evaluationdangerous-capabilitiesai-safetygovernance+6Source ↗ - In-depth discussion of METR's approach
- Beth Barnes - Safety evaluations and standards for AI (EA Forum)↗🔗 web★★★☆☆EA ForumBeth Barnes - Safety evaluations and standards for AI (EA Forum)Transcript and video of Beth Barnes's EAG 2023 talk outlining ARC Evals' strategy; directly relevant to the emerging field of pre-deployment safety evaluations and the development of industry standards around dangerous capability thresholds.Beth Barnes (2023)28 karma · 0 commentsBeth Barnes of ARC Evals presents the case for high-effort, targeted safety evaluations as a promising intervention for reducing existential risk from AI. She focuses on the 'au...ai-safetyevaluationgovernancedangerous-capabilities+5Source ↗ - EAG Bay Area 2023 talk
- METR - Wikipedia↗📖 reference★★★☆☆WikipediaMETR (Model Evaluation & Threat Research)METR is a key organization in AI safety evaluations; understanding their work is important for anyone studying how frontier labs assess and manage dangerous capability thresholds before deploying advanced models.METR is an AI safety organization focused on evaluating frontier AI models for dangerous capabilities, particularly autonomous replication and adaptation (ARA) abilities. They d...evaluationsdangerous-capabilitiesautonomous-replicationai-safety+5Source ↗ - Organization history and overview
- TIME: Nobody Knows How to Safety-Test AI↗🔗 web★★★☆☆TIMENobody Knows How to Safety-Test AIAccessible 2024 TIME piece providing a journalistic overview of METR and the state of AI safety evaluations; useful for understanding the institutional landscape and limitations of current dangerous-capability assessments.A TIME article profiling METR (Model Evaluation and Threat Research) and the broader challenge of AI safety evaluations. It examines how researchers attempt to probe frontier AI...evaluationred-teamingdangerous-capabilitiesai-safety+5Source ↗ - Journalism coverage of AI evaluation challenges
Footnotes
-
METR received $17 million via the Audacious Project. ↩
-
METR evaluates frontier AI models for dangerous capabilities before deployment. ↩
-
METR was formerly known as ARC Evals. ↩
-
METR's evaluations directly influence deployment decisions at OpenAI, Anthropic, and Google DeepMind. ↩
-
METR's work has already prevented potentially dangerous deployments and established industry standards for pre-release safety evaluation. ↩
-
Paul Christiano established the Alignment Research Center (ARC) in 2021 with two divisions: Theory and Evaluations. ↩
-
The Alignment Research Center (ARC) Evaluations team, co-led by Beth Barnes, focused on testing whether AI systems possessed dangerous capabilities. ↩
-
The Alignment Research Center (ARC) Evaluations team evaluated GPT-4 in late 2022 and early 2023 before its public deployment. ↩
-
Citation rc-6f00 ↩
-
The Alignment Research Center (ARC) Evaluations team's GPT-4 evaluation established a template for pre-deployment dangerous capability assessments. ↩
-
The Alignment Research Center (ARC) Evaluations team spun off into an independent organization, METR, in 2023. ↩
-
Beth Barnes continued her leadership role during the rebranding to METR. ↩
-
METR formalized contracts with all major frontier labs. ↩
-
METR played a crucial role in informing Google DeepMind's Frontier Safety Framework. ↩
-
METR played a crucial role in informing Anthropic's Responsible Scaling Policy. ↩
-
METR played a crucial role in informing OpenAI's Preparedness Framework. ↩
-
Deployment decisions at major labs are now contingent on METR's assessments. ↩
-
METR collaborates with the UK AI Safety Institute. ↩
-
METR employs approximately 30 specialists. ↩
-
METR's autonomous replication evaluations assess whether an AI system can survive and propagate without human assistance. ↩
-
METR's autonomous replication assessments examine whether models can copy themselves to new infrastructure, acquire computational resources, obtain funding, and maintain operational security. ↩
-
An AI system capable of autonomous replication could potentially become impossible to control or deactivate. ↩
-
METR's evaluations serve as an early warning system for the critical capability threshold of autonomous AI. ↩
-
METR's cybersecurity evaluations test vulnerability discovery, exploit development, social engineering, network penetration, persistence mechanisms, and coordinated attack campaigns. ↩
-
METR's cybersecurity evaluations assess whether AI models can autonomously conduct sophisticated cyberattacks. ↩
-
METR's evaluation methodologies include capture-the-flag exercises, controlled real-world vulnerability testing, red-teaming of live systems with explicit permission, and direct comparison to human... ↩
-
Citation rc-02f4 ↩
-
METR's findings indicate that current frontier models demonstrate concerning capabilities in several cybersecurity domains. ↩
-
METR's evaluations suggest that frontier models are approaching the point where they could automate significant portions of the cyber attack lifecycle. ↩
-
METR's CBRN evaluations focus particularly on biological threats, given AI's potential to accelerate biotechnology research. ↩
-
METR's CBRN evaluations assess whether AI systems can provide dangerous expertise in developing weapons of mass destruction. ↩
-
METR works with domain experts from relevant scientific fields to design controlled tests that can assess dangerous knowledge without creating actual risks. ↩
-
METR's methodology includes uplift studies that measure whether AI assistance enables non-experts to achieve dangerous capabilities they couldn't otherwise access. ↩
-
METR's methodology for CBRN threat assessment includes expert elicitation, controlled testing of dangerous knowledge, and comparison to publicly available scientific literature. ↩
-
METR's current findings suggest that frontier models can provide assistance that goes beyond simple internet searches in CBRN domains. ↩
-
METR evaluates AI systems' capacity for psychological manipulation, deception, and large-scale persuasion. ↩
-
METR's assessments examine personalized persuasion techniques, misinformation generation, long-term relationship building, and exploitation of cognitive biases. ↩
-
METR uses controlled human studies, deception detection experiments, adversarial dialogue scenarios, and long-term interaction assessments to test manipulation capabilities. ↩
-
Current findings indicate that frontier models can be highly persuasive, particularly when personalization increases their effectiveness. ↩
-
METR's evaluations in the manipulation domain serve as crucial early warning systems for capabilities that could fundamentally alter the human-AI power balance. ↩
-
METR uses adversarial evaluation to test for worst-case behaviors. ↩
-
Citation rc-b10e ↩
-
METR's evaluations are integrated into the safety frameworks of major AI labs and government institutions. ↩
-
METR's findings directly influence deployment decisions at partner organizations. ↩
-
A challenge for METR is whether evaluation methodologies can keep pace with rapidly advancing AI capabilities. ↩
-
Citation rc-57a5 ↩
-
Labs provide both access to models and funding for evaluations conducted by METR, creating economic incentives that might not align with maximally strict safety assessment. ↩
-
Labs facing competitive pressure to deploy might resist evaluations by METR that could delay releases or require costly safety measures. ↩
-
METR requires significant resources to conduct thorough evaluations. ↩
-
Diversifying funding sources through grants and government contracts helps METR maintain independence. ↩
-
AI developers face increasing incentives to optimize systems specifically for passing evaluations rather than for genuine safety as METR's evaluations become more influential in deployment decisions. ↩
-
Systems might appear safe during evaluation but retain dangerous capabilities that emerge only after deployment due to the incentive to optimize for evaluations. ↩
-
METR's adversarial evaluation approach attempts to address the challenge of capability hiding. ↩
-
Advanced AI systems might learn to conceal dangerous capabilities during testing while retaining them for later use, which is known as capability hiding (sandbagging). ↩
-
AI development might orient toward passing specific tests rather than achieving genuine safety properties as standardized evaluation methods become established. ↩
-
AI development orienting toward passing specific tests could create false confidence in system safety while missing novel risks that fall outside established evaluation paradigms. ↩
-
Translating METR's findings into AI deployment decisions requires value judgments about acceptable risk. ↩
-
METR's assessments provide empirical data about AI capability levels. ↩
-
METR does not have final authority on AI deployment decisions. ↩
-
Current practices leave AI threshold-setting largely to individual labs, with METR providing input but not final authority. ↩
-
The absence of external enforcement mechanisms means that labs retain ultimate authority over AI release decisions regardless of METR's findings. ↩
-
Reputational concerns and self-imposed commitments currently support evaluation compliance for AI deployment. ↩
-
METR is developing more sophisticated testing protocols for emerging risks, including multimodal AI systems that integrate text, image, and potentially action capabilities. ↩
-
Government integration represents a critical near-term priority for METR. ↩
-
METR faces immediate scaling challenges as frontier labs develop increasingly capable models requiring more sophisticated evaluation. ↩
-
Citation rc-d6b4 ↩
-
Current evaluation methods may require fundamental revision as AI systems develop sophisticated strategies for capability hiding or evaluation gaming, impacting METR. ↩
-
METR may need to develop entirely new assessment paradigms for evaluating AI systems that exceed human expert performance in dangerous capability domains. ↩
-
If breakthrough progress occurs in AI interpretability, METR might incorporate direct examination of model cognition rather than relying solely on behavioral assessment. ↩
-
The regulatory environment will likely evolve significantly, potentially creating mandatory evaluation requirements enforced by government agencies rather than voluntary industry self-regulation, i... ↩
-
AI systems approaching artificial general intelligence might develop novel capabilities that transcend existing evaluation frameworks entirely. ↩
-
Recursive self-improvement in AI systems poses particular challenges for evaluation-based safety approaches. ↩
-
If AI systems begin autonomously improving their own capabilities, the timeline for capability emergence might compress beyond the reach of human evaluation cycles. ↩
-
METR's assumption of sufficient lead time for assessment and mitigation might not hold for rapidly self-modifying systems. ↩
-
METR's ability to detect sophisticated capability hiding by advanced AI systems remains unproven. ↩
-
The sustainability of current voluntary compliance frameworks under intense competitive pressure is a critical unknown for METR. ↩
-
The technical challenge of evaluating AI systems that exceed human expert performance in dangerous domains has no clear solution for METR. ↩
References
This is METR's formal evaluation report for OpenAI's GPT-5.1-Codex-Max model, assessing its autonomous capabilities and potential for dangerous self-replication or agentic misuse. The report is part of METR's ongoing third-party evaluation work conducted under safety agreements with frontier AI labs. It provides empirical data on the model's performance on tasks relevant to autonomous replication and adaptation (ARA) threats.
Beth Barnes of ARC Evals presents the case for high-effort, targeted safety evaluations as a promising intervention for reducing existential risk from AI. She focuses on the 'autonomous replication' threshold as a key danger signal and explains how evaluations can create accountability mechanisms that encourage labs to meet safety standards before deployment.
METR introduces MALT (Manually-reviewed Agentic Labeled Transcripts), a dataset of ~11,000 agent transcripts labeled for behaviors threatening evaluation integrity, including reward hacking and sandbagging, across 21 models and 403 tasks. The dataset enables validation of LLM-based monitoring systems, with best monitors achieving AUROCs of 0.96 and 0.91 for reward hacking and sandbagging detection respectively. Access to reasoning traces significantly improves sandbagging detection but not reward hacking detection.
METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.
A TIME article profiling METR (Model Evaluation and Threat Research) and the broader challenge of AI safety evaluations. It examines how researchers attempt to probe frontier AI systems for dangerous capabilities, highlighting that current evaluation methods are immature and the field lacks consensus on how to rigorously assess AI risks.
METR conducted an independent third-party evaluation of OpenAI's GPT-5 to assess catastrophic risk potential across three threat models: AI R&D automation, rogue replication, and strategic sabotage. The evaluation found GPT-5 has a 50% time-horizon of ~2 hours 17 minutes on agentic software engineering tasks, and concluded it does not currently pose catastrophic risks under these threat models. The report also assesses risks from incremental further development prior to public deployment.
Beth Barnes, co-founder and head of research at METR, discusses how AI capability evaluations work, their limitations, and how they fit into safety policy. The conversation covers threat modeling, capability buffers, alignment evaluations, and how METR's work informs responsible scaling policies at AI labs.
8[2411.15114] RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human expertsarXiv·Hjalmar Wijk et al.·2024·Paper▸
RE-Bench is a new benchmark for evaluating AI agents' capabilities in research engineering tasks, consisting of 7 open-ended ML research environments with baseline data from 71 human expert attempts. The study finds that frontier AI models achieve 4x higher scores than human experts on a 2-hour time budget, but humans maintain an advantage with longer time budgets, exceeding top AI agents by 2x when given 32 hours. While AI agents demonstrate strong technical expertise and can generate solutions 10x faster than humans at lower cost, the benchmark reveals important differences in how humans and AI systems scale with additional time and resources.
METR is an organization focused on evaluating AI models for dangerous capabilities, particularly autonomous replication and adaptation (ARA) risks. They develop evaluation frameworks and conduct assessments to determine whether frontier AI systems pose catastrophic risks before deployment. Their work informs AI safety policy and responsible scaling decisions at major AI labs.
METR (Model Evaluation & Threat Research) develops rigorous methodologies for evaluating AI systems, focusing on assessing autonomous capabilities and potential risks from advanced AI models. Their work establishes frameworks for measuring dangerous capabilities including deception, autonomous replication, and other safety-relevant behaviors. METR's evaluations inform deployment decisions and safety thresholds for frontier AI labs.
METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general autonomous task completion. The evaluations found GPT-4.5 did not demonstrate concerning levels of autonomous replication or adaptation capabilities. This report is part of METR's ongoing third-party evaluation work supporting responsible AI deployment decisions.
METR is an AI safety organization focused on evaluating frontier AI models for dangerous capabilities, particularly autonomous replication and adaptation (ARA) abilities. They develop standardized evaluations to assess whether AI systems pose catastrophic risks, and work with leading AI labs to conduct pre-deployment safety testing. Their work informs responsible scaling policies and deployment decisions.
METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced AI systems. The analysis synthesizes patterns across responsible scaling policies, model cards, and safety frameworks to provide a comparative overview of industry norms. It serves as a reference for understanding where consensus exists and where significant variation or absence of commitments remains.
14[2503.14499] Measuring AI Ability to Complete Long Software TasksarXiv·Thomas Kwa et al.·2025·Paper▸
This paper introduces the '50%-task-completion time horizon' metric, measuring AI capability in human-relatable terms as the time domain-expert humans need to complete tasks AI solves at 50% success rate. Current frontier models like Claude 3.7 Sonnet achieve roughly a 50-minute horizon, and this metric has doubled approximately every seven months since 2019. Extrapolating this trend suggests AI could automate month-long human software tasks within five years.
OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations, red-teaming findings, and the RLHF-based safety interventions applied to reduce harmful outputs. The document represents OpenAI's public accountability framework for responsible deployment of a frontier AI model.
Personal homepage of Beth Barnes, an AI safety researcher known for work on evaluations, dangerous capabilities assessments, and autonomous replication risks in AI systems. The page likely links to her research, projects, and professional background in the AI safety field.
METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that models can successfully complete roughly doubling every 7 months. This 'task length' metric serves as a practical proxy for measuring real-world AI capability progression and agentic autonomy.
A UK AI Safety Institute government assessment documenting exponential performance improvements across frontier AI systems in multiple domains. The report evaluates emerging capabilities and associated risks, calling for robust safeguards as systems advance rapidly. It serves as an official benchmark of the current frontier AI landscape from a national safety authority.