METR’s GPT-4.5 pre-deployment evaluations

web

METR·metr.org/blog/2025-02-27-gpt-4-5-evals/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Published by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.

Metadata

Importance: 62/100blog postprimary source

Summary

METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general autonomous task completion. The evaluations found GPT-4.5 did not demonstrate concerning levels of autonomous replication or adaptation capabilities. This report is part of METR's ongoing third-party evaluation work supporting responsible AI deployment decisions.

Key Points

•GPT-4.5 was evaluated for autonomous replication and adaptation (ARA) capabilities before public deployment, following METR's standard evaluation protocol.
•Results indicated GPT-4.5 did not reach threshold levels of concern for dangerous autonomous capabilities such as self-replication or acquiring resources.
•This is part of a series of pre-deployment evaluations METR conducts for frontier AI labs, establishing empirical baselines for safety-relevant capabilities.
•The evaluation methodology focuses on whether models could meaningfully assist in their own replication or escape human oversight in realistic scenarios.
•Third-party evaluations like this are a key component of responsible scaling policies and deployment governance frameworks.

Cited by 3 pages

Page	Type	Quality
METR	Organization	66.0
Evals-Based Deployment Gates	Approach	66.0
Third-Party Model Auditing	Approach	64.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202612 KB

METR’s GPT-4.5 pre-deployment evaluations 

 
 
 
 
 

 

 
 

 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 Research 
 

 
 
 Notes 
 

 
 
 Updates 
 

 
 
 About 
 

 
 
 Donate 
 

 
 
 Careers 
 

 

 
 
 
 Search
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 -->
 
 
 
 

 
 
 
 
 
 
 
 Research 
 

 
 
 
 
 
 
 Notes 
 

 
 
 
 
 
 
 Updates 
 

 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 Careers 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 Menu 
 
 
 

 
 METR’s GPT-4.5 pre-deployment evaluations 
 
 
 
 
 
 
 DATE

 February 27, 2025 
 
 
 
 
 SHARE

 
 
 Copy Link
 
 
 
 Citation
 
 
 
 
 BibTeX Citation 
 &times; 
 
 
 @misc { metr-s-gpt-4-5-pre-deployment-evaluations , 
 title = {METR’s GPT-4.5 pre-deployment evaluations} , 
 author = {METR} , 
 howpublished = {\url{https://metr.org/blog/2025-02-27-gpt-4-5-evals/}} , 
 year = {2025} , 
 month = {02} , 
 } 
 
 Copy 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 As described in OpenAI’s GPT-4.5 System Card , METR received access to an earlier checkpoint of GPT-4.5 from OpenAI a week prior to model release. To assist with our evaluations, OpenAI also provided us with some context and technical information about GPT-4.5, as well as a subset of the model’s benchmark results from their internal evaluations. METR subsequently performed a preliminary set of evaluations to measure the model’s performance on our general autonomy suite and RE-Bench. Our findings are consistent with OpenAI’s results: GPT-4.5 exhibited overall performance somewhere between GPT-4o and o1.

 
 
 

As a way of measuring AI capabilities, we can compute the “X% time horizon”, defined as the duration of tasks that an LLM agent can complete with X% reliability. Here, we display the 50%-time horizon for GPT-4.5, compared against four existing models. By this metric, GPT-4.5 performs between GPT-4o and o1. Note that our evaluations of GPT-4.5 used an agent scaffold optimized for o1. In a forthcoming report, we describe this methodology in more detail and present results for additional models.
 
 

 Third-party oversight based on verifying developers’ internal results is a promising direction to explore further. We greatly appreciate OpenAI’s help in starting to prototype this form of third-party oversight, and found the provided information helpful for contextualizing our observations about the model. Separately, we also appreciate OpenAI’s GPT-4.5 System Card , which details the extensive evaluations performed on the model.

 We currently believe that it’s unlikely that GPT-4.5 poses large risks—especially relative to existing frontier models. While both GPT-4.5’s benchmark performance and METR’s preliminary evaluations are consistent with this tentative conclusion, this belief stems primarily from our general understanding of both the capabilities and limitations of current frontier models. We do not believe that our evaluations (or the results shared with

... (truncated, 12 KB total)

Resource ID: a86b4f04559de6da | Stable ID: sid_CK0nzfm629