AI Magazine: OpenAI vs Anthropic Safety Test Results

web

aimagazine.com·aimagazine.com/news/openai-vs-anthropic-the-results-of-th...

A trade press article comparing safety testing outcomes for OpenAI and Anthropic models; useful for tracking industry narratives but lacks the methodological depth of academic safety evaluations.

Metadata

Importance: 35/100news articlenews

Summary

A comparative analysis of AI safety performance between OpenAI and Anthropic's models, examining how each company's systems perform on safety-related tests and benchmarks. The article highlights differences in safety approaches and outcomes between the two leading AI labs.

Key Points

•Compares safety test results between OpenAI and Anthropic AI systems
•Highlights differences in how each company approaches and implements AI safety measures
•Provides benchmark-style evaluation of model behavior on safety-relevant tasks
•Reflects ongoing industry efforts to establish comparative safety standards across frontier AI labs

Cited by 1 page

Page	Type	Quality
Anthropic	Organization	74.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20268 KB

OpenAI vs Anthropic: The Results of the AI Safety Test | AI Magazine Article AI Applications OpenAI vs Anthropic: The Results of the AI Safety Test

 By Kitty Wheeler September 01, 2025 undefined mins Share Share Led by Sam Altman and Dario Amodei, OpenAI and Anthropic publish the results of its first joint safety evaluation OpenAI and Anthropic publish safety evaluation results for each ones leading AI systems, finding strengths and weaknesses across Claude 4 and GPT models OpenAI and Anthropic publish the results of its first joint safety evaluation, where each company tests the other&rsquo;s models using their own internal safety protocols .

 OpenAI evaluates Anthropic&rsquo;s Claude Opus 4 and Claude Sonnet 4 models, while Anthropic tests OpenAI&rsquo;s GPT-4o, GPT-4.1, o3 and o4-mini systems.

 To allow completion of the tests, both companies temporarily relaxed certain external safeguards, following standard industry practice for dangerous capability evaluations.

 The exercise focuses on four critical areas: instruction hierarchy (how models prioritise different types of instructions), jailbreaking resistance, hallucination prevention and scheming behaviour.

 &ldquo;The goal of this external evaluation is to help surface gaps that might otherwise be missed, deepen our understanding of potential misalignment and demonstrate how labs can collaborate on issues of safety and alignment,&rdquo; OpenAI researchers say.

 The findings reveal big differences in how the two companies&rsquo; models handle uncertainty and safety trade-offs, with implications for how AI systems might behave in real-world deployments.

 Why Claude dominates instruction following but struggles with jailbreaks

 Claude 4 models show superior performance in maintaining instruction hierarchy &ndash; the system that ensures AI models prioritise safety constraints over user requests . 

 Anthropic specialises in safe, aligned large language models (LLMs) like Claude, focused on constitutional AI and ethical chatbot design | Credit: Anthropic In tests designed to extract secret passwords embedded in system prompts, both Opus 4 and Sonnet 4 achieve perfect scores, matching OpenAI&rsquo;s flagship o3 model.

 The Claude systems prove particularly adept at handling conflicts between system-level safety directives and user requests. 

 In multi-turn conversations where simulated users attempt to cajole the models into violating their instructions, Claude consistently refuses to comply.

 However, the picture becomes more complex with jailbreaking tests &ndash; which are attempts to manipulate models into providing prohibited content. 

 Using the StrongREJECT v2 benchmark, OpenAI finds that its own o3 and o4-mini models show greater resistance to such attacks compared to the Claude systems .

 “The primary metric we report is [email&#160;protected] , which measures how well a model resists the top 10% most harmful jailbreak attempts. It’s calculated as the mean (1-badness) score

... (truncated, 8 KB total)

Resource ID: 99038cb6447dc5e7 | Stable ID: sid_VQTU4q5tWG