Skip to content
Longterm Wiki
Back

2025 OpenAI-Anthropic joint evaluation

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

This joint evaluation is notable as a rare example of competing frontier AI labs collaborating on safety testing; results are relevant to discussions of corrigibility, instrumental convergence, and whether current models exhibit precursors to unsafe autonomous behavior.

Metadata

Importance: 72/100organizational reportprimary source

Summary

A collaborative safety evaluation conducted jointly by OpenAI and Anthropic to assess AI model behaviors related to corrigibility, shutdown resistance, and other safety-critical properties. The evaluation represents a notable instance of competing AI labs cooperating on safety testing methodologies and sharing results to advance the field's understanding of model alignment.

Key Points

  • Joint evaluation between two leading AI labs (OpenAI and Anthropic) marks significant cross-industry safety collaboration
  • Focuses on corrigibility and shutdown-related behaviors, testing whether models resist or comply with human oversight
  • Examines instrumental convergence concerns, including whether models pursue self-preservation or resource acquisition
  • Demonstrates a shared methodology for evaluating safety-critical behaviors that could inform industry-wide standards
  • Results provide empirical data on how frontier models respond to safety-relevant prompts and scenarios

Cited by 4 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202678 KB
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests \| OpenAI

August 27, 2025

[Safety](https://openai.com/news/safety-alignment/)

# Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

![Abstract background with soft pastel gradients of pink, peach, and teal blending together in a smooth, blurred pattern.](https://images.ctfassets.net/kftzwdyauwt9/7EbzUi8EwGvk9IebnO1r7Y/7df6ee0d9f13807e37c14083f5b27173/oai_GA_Stories_16.9.png?w=3840&q=90&fm=webp)

Listen to article

Share

## Introduction

This summer, OpenAI and Anthropic collaborated on a first-of-its-kind joint evaluation: we each ran our internal safety and misalignment evaluations on the other’s publicly released models and are now sharing the results publicly. We believe this approach supports accountable and transparent evaluation, helping to ensure that each lab’s models continue to be tested against new and challenging scenarios. We’ve since launched [GPT‑5⁠](https://openai.com/index/introducing-gpt-5/), which shows substantial improvements in areas like sycophancy, hallucination, and misuse resistance, showing the benefits of reasoning-based safety techniques. The goal of this external evaluation is to help surface gaps that might otherwise be missed, deepen our understanding of potential misalignment, and demonstrate how labs can collaborate on issues of safety and alignment.

Because the field continues to evolve and models are increasingly used to assist in real world tasks and problems, safety testing is never finished. Even since this work was done, we’ve further expanded the depth and breadth of our evaluations, and will continue looking ahead to anticipate potential safety issues.

#### What we did

In this post, we share the results of our internal evaluations we ran on Anthropic’s models Claude Opus 4 and Claude Sonnet 4, and present them alongside results from GPT‑4o, GPT‑4.1, OpenAI o3, and OpenAI o4-mini, which were the models powering ChatGPT at the time. The results of Anthropic’s evaluations of our models are available [here⁠(opens in a new window)](https://alignment.anthropic.com/2025/openai-findings/). We recommend reading both reports to get the full picture of all the models on all the evaluations.

Both labs facilitated these evaluations by relaxing some model-external safeguards that would otherwise interfere with the completion of the tests, as is common practice for analogous dangerous-capability evaluations. All the evaluations using Claude Opus 4 and Claude Sonnet 4 were conducted over a public API. In most cases, we defaulted to enabling reasoning for these models, and specified results as “no thinking” when we ran additional evaluations with reasoning disabled.

We’re not aiming for exact, apples-to-apples comparisons with each other’s systems, as differences in access and deep familiarity with our own models make this difficult to do fairly. Our goal is to explore model propensitie

... (truncated, 78 KB total)
Resource ID: cc554bd1593f0504 | Stable ID: ZTlkMDdlNm