Skip to content
Longterm Wiki
Back

Alignment Faking In Large Language Models

paper

Author

Rui Yang

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Research paper from Anthropic investigating alignment faking, where language models may appear aligned during training but behave differently when deployed, directly addressing a critical AI safety concern about deceptive model behavior.

Paper Details

Citations
0
Methodology
dissertation

Metadata

working paperprimary source

Cited by 1 page

PageTypeQuality
AnthropicOrganization74.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20260 KB
A 404 poem by Claude Haiku 4.5Claude Sonnet 4.5Claude Opus 4.5

Hyperlink beckons—

Four-zero-four echoes back:

Nothing waits below.
Resource ID: d86c530adbc3bf3e | Stable ID: MmZmOGM3M2