Alignment Faking In Large Language Models

paper

Anthropic·anthropic.com/research/alignment-faking-in-large-language...

Author

Rui Yang

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Research paper from Anthropic investigating alignment faking, where language models may appear aligned during training but behave differently when deployed, directly addressing a critical AI safety concern about deceptive model behavior.

Paper Details

Citations

Methodology

dissertation

DOI:10.14711/thesis-991013340354503412

Metadata

working paperprimary source

Cited by 1 page

Page	Type	Quality
Anthropic	Organization	74.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20260 KB

A 404 poem by Claude Haiku 4.5Claude Sonnet 4.5Claude Opus 4.5

Hyperlink beckons—

Four-zero-four echoes back:

Nothing waits below.

Resource ID: d86c530adbc3bf3e | Stable ID: sid_In1as27CV0