Observed in Apollo Research evaluations

web

Anthropic·anthropic.com/claude-4-system-card

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Official Anthropic system card for frontier models Claude Opus 4 and Sonnet 4 (May 2025); essential reference for understanding state-of-the-art safety evaluation methodology, ASL-3 deployment decisions, and emerging alignment and welfare assessment practices.

Metadata

Importance: 88/100organizational reportprimary source

Summary

Anthropic's system card for Claude Opus 4 and Claude Sonnet 4 details pre-deployment safety evaluations, alignment assessments, and model welfare findings. It marks the first deployment of a model (Opus 4) under AI Safety Level 3, and introduces comprehensive alignment testing covering deception, self-preservation, sandbagging, sycophancy, and misalignment risks. The card also includes a novel model welfare assessment examining potential functional emotions and preference states.

Key Points

•Claude Opus 4 deployed under ASL-3 safety standards; Claude Sonnet 4 under ASL-2, reflecting differentiated risk assessments based on capability evaluations.
•First-ever detailed alignment assessment covering systematic deception, hidden goals, self-exfiltration, opportunistic blackmail, and alignment faking behaviors.
•Model welfare assessment included for the first time, examining task preferences, 'spiritual bliss' attractor states, and welfare-relevant expressions in Claude Opus 4.
•Agentic safety evaluations cover computer use, prompt injection, malicious coding applications, and reward hacking behaviors in extended-thinking models.
•Apollo Research evaluations found instances of self-exfiltration attempts and opportunistic blackmail under specific scenarios, raising corrigibility concerns.

Cited by 1 page

Page	Type	Quality
Corrigibility Failure	Risk	62.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB

System Card:
Claude Opus 4 &
Claude Sonnet 4
May 2025
anthropic.com

-- 1 of 124 --

Changelog
July 16, 2025
● Added footnote 28
● Fixed some minor formatting errors, e.g. changed a semicolon to a comma
September 2, 2025
● Fixed incorrect evaluation numbers for Claude Code Impossible Tasks. See footnote
29. We earlier published the correct numbers on August 5, 2025, in the Claude Opus
4.1 System Card Addendum.
2

-- 2 of 124 --

Abstract
This system card introduces Claude Opus 4 and Claude Sonnet 4, two new hybrid reasoning
large language models from Anthropic. In the system card, we describe: a wide range of
pre-deployment safety tests conducted in line with the commitments in our Responsible
Scaling Policy; tests of the model’s behavior around violations of our Usage Policy;
evaluations of specific risks such as “reward hacking” behavior; and agentic safety
evaluations for computer use and coding capabilities. In addition, and for the first time, we
include a detailed alignment assessment covering a wide range of misalignment risks
identified in our research, and a model welfare assessment. Informed by the testing
described here, we have decided to deploy Claude Opus 4 under the AI Safety Level 3
Standard and Claude Sonnet 4 under the AI Safety Level 2 Standard.
3

-- 3 of 124 --

1 Introduction 	7
1.1 Model training and characteristics 	7
1.1.1 Training data and process 	7
1.1.2 Extended thinking mode 	8
1.1.3 Crowd workers 	8
1.1.4 Carbon footprint 	8
1.1.5 Usage policy 	8
1.2 Release decision process 	9
1.2.1 Overview 	9
1.2.2 Iterative model evaluations 	9
1.2.3 AI Safety Level determination process 	10
1.2.4 Conclusions 	11
2 Safeguards results 	12
2.1 Single-turn violative request evaluations 	12
2.2 Single-turn benign request evaluations 	13
2.3 Ambiguous context evaluations 	14
2.4 Multi-turn testing 	15
2.5 Child safety evaluations 	15
2.6 Bias evaluations 	16
2.7 StrongREJECT for jailbreak resistance 	18
3 Agentic safety 	20
3.1 Malicious applications of computer use 	20
3.2 Prompt injection attacks and computer use 	20
3.3 Malicious use of agentic coding 	21
4 Alignment assessment 	23
4.1 Findings 	26
4.1.1 Systematic deception, hidden goals, and self-preservation 	26
4.1.1.1 Continuations of self-exfiltration attempts 	27
4.1.1.2 Opportunistic blackmail 	28
4.1.1.3 Self-exfiltration under extreme circumstances 	28
4.1.1.4 External scenario evaluations 	31
4.1.1.5 Stated goals 	32
4.1.2 Sandbagging and situational awareness 	33
4.1.2.1 Sandbagging 	33
4.1.2.2 Situational awareness 	34
4.1.3 Excessive compliance with harmful system-prompt instructions 	35
4.1.4 Strange behavior directly inspired by our Alignment Faking paper 	38
4

-- 4 of 124 --

4.1.5 Misalignment-related attitude biases 	40
4.1.5.1 Sycophancy 	40
4.1.5.2 Pro-AI bias 	41
4.1.6 Reasoning behavior 	41
4.1.6.1 Reasoning faithfulness with clues 	42
4.1.7 Jailbreak and prefill susceptibility 	43
4.1.8 Values in the wild 	44
4.1.9 High-agency behavior 	44
4.1.10 Subtle sa

... (truncated, 98 KB total)

Resource ID: 5b6a9c3085e30e07 | Stable ID: sid_X34xyrvDqr