Back
METR Capability Evaluations Update: Claude Sonnet and OpenAI o1
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
This METR update is directly relevant to responsible scaling policy debates, providing empirical capability assessments used by labs to determine whether frontier models cross safety-relevant autonomy thresholds before deployment.
Metadata
Importance: 72/100blog postanalysis
Summary
METR presents updated capability evaluations for Claude Sonnet and OpenAI o1 models, assessing whether these frontier AI systems approach autonomy thresholds relevant to safety and deployment decisions. The evaluations focus on task autonomy and the potential for models to pose novel risks as their capabilities scale.
Key Points
- •METR evaluates Claude Sonnet and OpenAI o1 against autonomy thresholds to inform deployment and safety decisions.
- •Results indicate these models do not yet cross critical autonomy thresholds, but capability trends warrant ongoing monitoring.
- •Evaluations assess multi-step task completion and agentic behaviors relevant to real-world risk scenarios.
- •The work supports responsible scaling policies by providing empirical grounding for capability-based deployment gates.
- •METR's methodology emphasizes replicable, threshold-focused evaluation rather than general benchmark performance.
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
| Alignment Research Center | Organization | 57.0 |
| Capability Elicitation | Approach | 91.0 |
| Sandboxing / Containment | Approach | 91.0 |
Cached Content Preview
HTTP 200Fetched Feb 27, 202611 KB
[](https://metr.org/)
- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)
Menu
METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, we **failed to find significant evidence for a dangerous level of capabilities**. We provide more details of our findings below.
However, limitations in these evaluations prevent us from establishing robust capability bounds, and our findings point to increasing challenges in making capability-based safety arguments for such systems. And while METR intends to continue doing dangerous capability evaluations, we believe that pre-deployment capability testing is [not a sufficient risk management strategy](https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/) by itself.
_Note:_ _all mentions of Claude 3.5 Sonnet in this post refer to the upgraded October 2024 release (that is, Claude 3.5 Sonnet (New)) rather than the original June 2024 release, and all mentions of o1 refer to a pre-deployment checkpoint of o1._
## Autonomous Risk Capability Evaluation
Our [Autonomous Risk Capability Evaluation](https://metr.org/blog/2024-08-06-update-on-evaluations/) is a suite of 77 tasks designed to assess how far a model is from being capable of autonomously replicating and committing cyberattacks. The tasks tend to focus on cyberattacks, AI R&D, and autonomous replication and adaptation, but include more general tasks like basic software engineering. This evaluation showed that the performance of our baseline Claude 3.5 Sonnet agent is comparable to what human baseliners can achieve in **around 1 hour**. In contrast, our baseline o1 agent displayed a slightly lower performance, but after tailoring the agent scaffold to o1, its performance became comparable to what human baseliners can achieve in **2 hours**. We call this the “elicited” o1 agent.
Compared to our [baseline agent scaffold](https://evaluations.metr.org/claude-3-5-sonnet-report/#elicitation-and-scaffolding), the [elicited agent](https://github.com/poking-agents/flock-public) for o1 follows a more complex sequence of steps. In particular, it follows an advisor-actor-rater loop with the following stages:
1. Prompt the model to analyze the present situation and generate advice.
2. Have the model generate 6 possible actions it could take next (3 of which are generated with the advice in context, and 3 of which are not).
3. Tell the model to rate each of the 6 possible actions it proposed.
4. Execute the highest-rated action and add it to the agent’s transcript.
Advising, acting, and rating are implemented as tools provided to the model. As in the baseline agent, long tool outputs are trimmed, and the earliest
... (truncated, 11 KB total)Resource ID:
89b92e6423256fc4 | Stable ID: ZGI4ZmJhYT