Skip to content
Longterm Wiki
Back

METR Capability Evaluations Update: Claude Sonnet and OpenAI o1

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

This METR update is directly relevant to responsible scaling policy debates, providing empirical capability assessments used by labs to determine whether frontier models cross safety-relevant autonomy thresholds before deployment.

Metadata

Importance: 72/100blog postanalysis

Summary

METR presents updated capability evaluations for Claude Sonnet and OpenAI o1 models, assessing whether these frontier AI systems approach autonomy thresholds relevant to safety and deployment decisions. The evaluations focus on task autonomy and the potential for models to pose novel risks as their capabilities scale.

Key Points

  • METR evaluates Claude Sonnet and OpenAI o1 against autonomy thresholds to inform deployment and safety decisions.
  • Results indicate these models do not yet cross critical autonomy thresholds, but capability trends warrant ongoing monitoring.
  • Evaluations assess multi-step task completion and agentic behaviors relevant to real-world risk scenarios.
  • The work supports responsible scaling policies by providing empirical grounding for capability-based deployment gates.
  • METR's methodology emphasizes replicable, threshold-focused evaluation rather than general benchmark performance.

Cited by 4 pages

PageTypeQuality
AI Capability Threshold ModelAnalysis72.0
Alignment Research CenterOrganization57.0
Capability ElicitationApproach91.0
Sandboxing / ContainmentApproach91.0

Cached Content Preview

HTTP 200Fetched Feb 27, 202611 KB
[![METR Logo](https://metr.org/assets/images/logo/logo.svg)](https://metr.org/)

- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)

Menu

METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, we **failed to find significant evidence for a dangerous level of capabilities**. We provide more details of our findings below.

However, limitations in these evaluations prevent us from establishing robust capability bounds, and our findings point to increasing challenges in making capability-based safety arguments for such systems. And while METR intends to continue doing dangerous capability evaluations, we believe that pre-deployment capability testing is [not a sufficient risk management strategy](https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/) by itself.

_Note:_ _all mentions of Claude 3.5 Sonnet in this post refer to the upgraded October 2024 release (that is, Claude 3.5 Sonnet (New)) rather than the original June 2024 release, and all mentions of o1 refer to a pre-deployment checkpoint of o1._

## Autonomous Risk Capability Evaluation

Our [Autonomous Risk Capability Evaluation](https://metr.org/blog/2024-08-06-update-on-evaluations/) is a suite of 77 tasks designed to assess how far a model is from being capable of autonomously replicating and committing cyberattacks. The tasks tend to focus on cyberattacks, AI R&D, and autonomous replication and adaptation, but include more general tasks like basic software engineering. This evaluation showed that the performance of our baseline Claude 3.5 Sonnet agent is comparable to what human baseliners can achieve in **around 1 hour**. In contrast, our baseline o1 agent displayed a slightly lower performance, but after tailoring the agent scaffold to o1, its performance became comparable to what human baseliners can achieve in **2 hours**. We call this the “elicited” o1 agent.

Compared to our [baseline agent scaffold](https://evaluations.metr.org/claude-3-5-sonnet-report/#elicitation-and-scaffolding), the [elicited agent](https://github.com/poking-agents/flock-public) for o1 follows a more complex sequence of steps. In particular, it follows an advisor-actor-rater loop with the following stages:

1. Prompt the model to analyze the present situation and generate advice.
2. Have the model generate 6 possible actions it could take next (3 of which are generated with the advice in context, and 3 of which are not).
3. Tell the model to rate each of the 6 possible actions it proposed.
4. Execute the highest-rated action and add it to the agent’s transcript.

Advising, acting, and rating are implemented as tools provided to the model. As in the baseline agent, long tool outputs are trimmed, and the earliest

... (truncated, 11 KB total)
Resource ID: 89b92e6423256fc4 | Stable ID: ZGI4ZmJhYT