Back
Alignment Faking In Large Language Models
paperAuthor
Rui Yang
Credibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Research paper from Anthropic investigating alignment faking, where language models may appear aligned during training but behave differently when deployed, directly addressing a critical AI safety concern about deceptive model behavior.
Paper Details
Citations
0
Methodology
dissertation
Metadata
working paperprimary source
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Anthropic | Organization | 74.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20260 KB
A 404 poem by Claude Haiku 4.5Claude Sonnet 4.5Claude Opus 4.5 Hyperlink beckons— Four-zero-four echoes back: Nothing waits below.
Resource ID:
d86c530adbc3bf3e | Stable ID: MmZmOGM3M2