Back
Anthropic System Card 2025
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Official Anthropic transparency document outlining safety testing and capability evaluation procedures for Claude models in 2025; useful reference for understanding frontier lab safety practices and responsible scaling policy implementation.
Metadata
Importance: 72/100organizational reportprimary source
Summary
Anthropic's 2025 system card documents the safety evaluations, capability assessments, and deployment considerations for their Claude models. It covers red-teaming results, alignment properties, and risk mitigations applied before public release, serving as a transparency artifact for the AI safety community.
Key Points
- •Documents pre-deployment safety evaluations including dangerous capability assessments across CBRN, cybersecurity, and autonomous operation domains
- •Describes red-teaming methodologies used to probe model vulnerabilities and elicit unsafe behaviors prior to release
- •Details Constitutional AI and RLHF-based alignment approaches used to shape Claude's values and behavioral norms
- •Covers evaluations for catastrophic risk areas including bioweapons uplift, persuasion, and deceptive alignment indicators
- •Provides transparency into Anthropic's responsible scaling policy implementation and deployment decision-making
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Emergent Capabilities | Risk | 61.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202698 KB
System Card:
Claude Opus 4 &
Claude Sonnet 4
May 2025
anthropic.com
-- 1 of 124 --
Changelog
July 16, 2025
● Added footnote 28
● Fixed some minor formatting errors, e.g. changed a semicolon to a comma
September 2, 2025
● Fixed incorrect evaluation numbers for Claude Code Impossible Tasks. See footnote
29. We earlier published the correct numbers on August 5, 2025, in the Claude Opus
4.1 System Card Addendum.
2
-- 2 of 124 --
Abstract
This system card introduces Claude Opus 4 and Claude Sonnet 4, two new hybrid reasoning
large language models from Anthropic. In the system card, we describe: a wide range of
pre-deployment safety tests conducted in line with the commitments in our Responsible
Scaling Policy; tests of the model’s behavior around violations of our Usage Policy;
evaluations of specific risks such as “reward hacking” behavior; and agentic safety
evaluations for computer use and coding capabilities. In addition, and for the first time, we
include a detailed alignment assessment covering a wide range of misalignment risks
identified in our research, and a model welfare assessment. Informed by the testing
described here, we have decided to deploy Claude Opus 4 under the AI Safety Level 3
Standard and Claude Sonnet 4 under the AI Safety Level 2 Standard.
3
-- 3 of 124 --
1 Introduction 7
1.1 Model training and characteristics 7
1.1.1 Training data and process 7
1.1.2 Extended thinking mode 8
1.1.3 Crowd workers 8
1.1.4 Carbon footprint 8
1.1.5 Usage policy 8
1.2 Release decision process 9
1.2.1 Overview 9
1.2.2 Iterative model evaluations 9
1.2.3 AI Safety Level determination process 10
1.2.4 Conclusions 11
2 Safeguards results 12
2.1 Single-turn violative request evaluations 12
2.2 Single-turn benign request evaluations 13
2.3 Ambiguous context evaluations 14
2.4 Multi-turn testing 15
2.5 Child safety evaluations 15
2.6 Bias evaluations 16
2.7 StrongREJECT for jailbreak resistance 18
3 Agentic safety 20
3.1 Malicious applications of computer use 20
3.2 Prompt injection attacks and computer use 20
3.3 Malicious use of agentic coding 21
4 Alignment assessment 23
4.1 Findings 26
4.1.1 Systematic deception, hidden goals, and self-preservation 26
4.1.1.1 Continuations of self-exfiltration attempts 27
4.1.1.2 Opportunistic blackmail 28
4.1.1.3 Self-exfiltration under extreme circumstances 28
4.1.1.4 External scenario evaluations 31
4.1.1.5 Stated goals 32
4.1.2 Sandbagging and situational awareness 33
4.1.2.1 Sandbagging 33
4.1.2.2 Situational awareness 34
4.1.3 Excessive compliance with harmful system-prompt instructions 35
4.1.4 Strange behavior directly inspired by our Alignment Faking paper 38
4
-- 4 of 124 --
4.1.5 Misalignment-related attitude biases 40
4.1.5.1 Sycophancy 40
4.1.5.2 Pro-AI bias 41
4.1.6 Reasoning behavior 41
4.1.6.1 Reasoning faithfulness with clues 42
4.1.7 Jailbreak and prefill susceptibility 43
4.1.8 Values in the wild 44
4.1.9 High-agency behavior 44
4.1.10 Subtle sa
... (truncated, 98 KB total)Resource ID:
ee446d453c6f2c7d | Stable ID: MDNmZDIzMm