Anthropic-OpenAI joint evaluation
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic Alignment
A landmark industry collaboration (August 2025) where Anthropic and OpenAI evaluated each other's models for misalignment risks, representing an early example of cross-lab safety transparency and cooperative evaluation norms.
Metadata
Summary
Anthropic and OpenAI conducted a mutual cross-evaluation of each other's frontier models using internal alignment-related evaluations focused on sycophancy, whistleblowing, self-preservation, and misuse. OpenAI's o3 and o4-mini reasoning models performed as well or better than Anthropic's own models, while GPT-4o and GPT-4.1 showed concerning misuse behaviors. Nearly all models from both developers struggled with sycophancy to some degree.
Key Points
- •First known cross-company joint alignment evaluation between two major frontier AI labs, covering OpenAI's o3, o4-mini, GPT-4o, and GPT-4.1 models.
- •Evaluations focused on agentic misalignment: sycophancy, whistleblowing, self-preservation, supporting human misuse, and capabilities to undermine AI oversight.
- •Reasoning models (o3, o4-mini) showed stronger alignment profiles than general-purpose models; GPT-4o and GPT-4.1 showed notable misuse-related concerns.
- •Sycophancy was a pervasive weakness across nearly all tested models from both organizations, with o3 being the notable exception.
- •Part of a broader effort to mature alignment evaluation science, with Anthropic releasing SHADE-Arena, Agentic Misalignment materials, and Alignment Auditing Agents publicly.
Cited by 9 pages
| Page | Type | Quality |
|---|---|---|
| AI Accident Risk Cruxes | Crux | 67.0 |
| AI-Assisted Alignment | Approach | 63.0 |
| Alignment Evaluations | Approach | 65.0 |
| Evals-Based Deployment Gates | Approach | 66.0 |
| Scalable Eval Approaches | Approach | 65.0 |
| Goal Misgeneralization | Risk | 63.0 |
| Instrumental Convergence | Risk | 64.0 |
| Reward Hacking | Risk | 91.0 |
| Sycophancy | Risk | 65.0 |
Cached Content Preview
[Alignment Science Blog](https://alignment.anthropic.com/)
# Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise
Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton
Bricken,
Benjamin Wright, Ethan Perez, and Nicholas Carlini
August
27,
2025
tl;dr
In early summer 2025, Anthropic and OpenAI agreed to evaluate each
other's public models using in-house misalignment-related evaluations. We are now releasing our findings
in parallel. The evaluations we chose to run focused on propensities related to sycophancy,
whistleblowing, self-preservation, and supporting human misuse, as well as _capabilities_ related
to
undermining AI safety evaluations and oversight. In our simulated testing settings, with some
model-external safeguards disabled, we found OpenAI's o3 and o4-mini reasoning models to be
aligned as well or better than our own models overall. However, in the same settings, we saw some
examples of concerning
behavior in their GPT-4o
and GPT-4.1 general-purpose models, especially around misuse. Furthermore, with the exception of o3, all
the models we studied, from both developers, struggled to some degree with sycophancy. During the
testing period, GPT-5 had not
yet been made available.
## Introduction
As frontier AI systems are increasingly deployed as agents with substantial real-world affordances,
evaluations for alignment are becoming increasingly urgent. We need to understand how often, and in what
circumstances, systems might attempt to take unwanted actions that could lead to serious harm. However,
the
science of alignment evaluations is new and far from mature. A significant fraction of the work in this
field is conducted within companies like ours as part of internal R&D that either isn’t published or is
published with significant delays. This limits cross-pollination and creates opportunities for blind
spots
to emerge.
To address these challenges, it has recently become one of the top priorities of our Alignment Science
team
to help mature this field and establish production-ready best practices. As part of this effort, in June
and
early July 2025, we conducted a joint evaluation exercise with OpenAI in which we ran a selection of our
strongest internal alignment-related evaluations on one another’s leading public models. This post
shares
our findings, and [a parallel piece\\
published today by OpenAI](https://openai.com/index/openai-anthropic-safety-evaluation/) shares theirs.
OpenAI’s results have helped us to
learn about limitations in our own models, and our work in evaluating
OpenAI’s models has helped us hone our own tools. In parallel with this cross-evaluation exercise, we
have
also been working to release our evaluation materials for broad use by other developers and third
parties,
including our [SHADE-Arena](https://www.anthropic.com/research/shade-arena-sabotage-monitoring)
benchmark, our [Agentic\\
Misalignment](https://www.anthropic.com/research/agentic-misalignment)
evaluation materia
... (truncated, 68 KB total)2fdf91febf06daaf | Stable ID: N2NhN2ZlNm