Skip to content
Longterm Wiki
Back

Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

METR (Model Evaluation and Threat Research) conducts independent pre-deployment evaluations for frontier AI labs; this report covers GPT-5.1-Codex-Max and is relevant to understanding current capability thresholds and how safety frameworks are applied in practice.

Metadata

Importance: 72/100organizational reportprimary source

Summary

This is METR's formal evaluation report for OpenAI's GPT-5.1-Codex-Max model, assessing its autonomous capabilities and potential for dangerous self-replication or agentic misuse. The report is part of METR's ongoing third-party evaluation work conducted under safety agreements with frontier AI labs. It provides empirical data on the model's performance on tasks relevant to autonomous replication and adaptation (ARA) threats.

Key Points

  • METR conducted an independent third-party evaluation of OpenAI's GPT-5.1-Codex-Max for dangerous autonomous capabilities.
  • The evaluation focuses on autonomous replication and adaptation (ARA) tasks, a key threat vector in frontier AI safety assessments.
  • Results inform OpenAI's deployment decisions under its Preparedness Framework and related safety commitments.
  • This is part of METR's broader program of pre-deployment evaluations for frontier AI models across major labs.
  • Findings contribute empirical evidence to debates about capability thresholds and deployment gating for advanced AI systems.

Cited by 1 page

PageTypeQuality
METROrganization66.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202636 KB
Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max | METR’s Autonomy Evaluation Resources 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 

 
 
 
 
 

 
 
 
 
 
 
 
 
 

 
 

 Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max

 Note on independence: This evaluation was conducted under a standard NDA. Due to the sensitive information shared with METR as part of this evaluation, OpenAI’s comms and legal team required review and approval of this post. 1 

 Summary

 METR previously found that further incremental development 2 of models similar to GPT-5-Thinking was unlikely to pose risks through AI R&D automation or rogue replication (see our GPT-5 evaluation report ). We believe that OpenAI GPT-5.1-Codex-Max indeed represents a low-risk incremental improvement for these threat models, with no risk-critical changes to affordances, architectures or training incentives, and a 50% time-horizon between 75m and 350m (point estimate 2h42m). This represents an on-trend improvement above GPT-5-Thinking, and the result broadly matched our qualitative impressions of the model, results on manually-scored tasks, and improvements on other benchmark scores. In light of this we again find that if trends hold, further development would pose low risk for these threat models, based on an aggressive extrapolation of our time horizon metric over the next 6 months. However, we cannot rule out significant trend-breaks (e.g. because of an algorithmic breakthrough or especially large compute scaleup).

 As with our GPT-5-Thinking evaluation, we base this assessment not only on our evaluation results, but also on analysis of GPT-5.1-Codex-Max reasoning traces and on assertions shared by OpenAI regarding expected capabilities, training pressures, elicitation and model affordances (see Table 1).

 Introduction

 METR has focused on two key threat models which we see as precursors to significant AI takeover risk 3 :

 
 AI R&D Automation: AI systems speeding up AI researchers by >10X (as compared to researchers with no AI assistance), or otherwise causing a rapid intelligence explosion , which could cause or amplify a variety of risks if stolen or handled without care.

 Rogue replication: AI systems having the ability acquire, maintain, and evade shutdown of the resources they need to operate independently from humans (see the rogue replication threat model ).

 

 If AI systems lack the capabilities for either of these precursors, we think AI takeover risk is likely to be low overall, though there may be other paths to risk that we have not covered.

 Since risks from AI R&D automation and rogue replication can occur before a model is publicly deployed , we focus our assessment primarily on upper-bounding risks from further development , beyond the studied models, based on extrapolating capability improvement trends.

 METR previously assessed whether further incremental development (defined in footnote 1) beyond GPT-5 might enable these threat models. We found that

... (truncated, 36 KB total)
Resource ID: 1060f486990f5014 | Stable ID: ZWE0ZTIxYj