METR’s GPT-4.5 pre-deployment evaluations
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
Published by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.
Metadata
Summary
METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general autonomous task completion. The evaluations found GPT-4.5 did not demonstrate concerning levels of autonomous replication or adaptation capabilities. This report is part of METR's ongoing third-party evaluation work supporting responsible AI deployment decisions.
Key Points
- •GPT-4.5 was evaluated for autonomous replication and adaptation (ARA) capabilities before public deployment, following METR's standard evaluation protocol.
- •Results indicated GPT-4.5 did not reach threshold levels of concern for dangerous autonomous capabilities such as self-replication or acquiring resources.
- •This is part of a series of pre-deployment evaluations METR conducts for frontier AI labs, establishing empirical baselines for safety-relevant capabilities.
- •The evaluation methodology focuses on whether models could meaningfully assist in their own replication or escape human oversight in realistic scenarios.
- •Third-party evaluations like this are a key component of responsible scaling policies and deployment governance frameworks.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| METR | Organization | 66.0 |
| Evals-Based Deployment Gates | Approach | 66.0 |
| Third-Party Model Auditing | Approach | 64.0 |
Cached Content Preview
[](https://metr.org/)
- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)
Menu
As described in [OpenAI’s GPT-4.5 System Card](https://openai.com/index/gpt-4-5-system-card/), METR received access to an earlier checkpoint of GPT-4.5 from OpenAI a week prior to model release. To assist with our evaluations, OpenAI also provided us with some context and technical information about GPT-4.5, as well as a subset of the model’s benchmark results from their internal evaluations. METR subsequently performed a preliminary set of evaluations to measure the model’s performance on our general autonomy suite and RE-Bench. Our findings are consistent with OpenAI’s results: GPT-4.5 exhibited overall performance somewhere between GPT-4o and o1.

As a way of measuring AI capabilities, we can compute the “X% time horizon”, defined as the duration of tasks that an LLM agent can complete with X% reliability. Here, we display the 50%-time horizon for GPT-4.5, compared against four existing models. By this metric, GPT-4.5 performs between GPT-4o and o1. Note that our evaluations of GPT-4.5 used an agent scaffold optimized for o1. In a forthcoming report, we describe this methodology in more detail and present results for additional models.
Third-party oversight based on verifying developers’ internal results is a promising direction to explore further. We greatly appreciate OpenAI’s help in starting to prototype this form of third-party oversight, and found the provided information helpful for contextualizing our observations about the model. Separately, we also appreciate [OpenAI’s GPT-4.5 System Card](https://openai.com/index/gpt-4-5-system-card/), which details the extensive evaluations performed on the model.
We currently believe that it’s unlikely that GPT-4.5 poses large risks—especially relative to existing frontier models. While both GPT-4.5’s benchmark performance and METR’s preliminary evaluations are consistent with this tentative conclusion, this belief stems primarily from our general understanding of both the capabilities and limitations of current frontier models. We do not believe that our evaluations (or the results shared with us by OpenAI\[1\][1](https://metr.org/blog/2025-02-27-gpt-4-5-evals/#fn:1)) themselves are sufficient to rule out all large-scale risks from models such as GPT-4.5, both because [pre-deployment evaluations do not address risks from internal deployment](https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-dep
... (truncated, 12 KB total)a86b4f04559de6da | Stable ID: NDc0NDczYj