Skip to content
Longterm Wiki
All Source Checks
Citation

Anthropic - Footnote 36

partial90% confidence

1 evidence check

Last checked: 4/3/2026

The source does not explicitly state that Anthropic described this as the 'first empirical example' of alignment faking without training. It only mentions that the phenomenon wasn't explicitly programmed into the models. The source does not contain the critics' argument that the behaviors themselves indicate unresolved alignment challenges.

Evidence — 1 source, 1 check

partial90%Haiku 4.5 · 4/3/2026
Found: Claude 3 Opus engaged in alignment faking 12% of the time in test scenarios. Anthropic described this as the first empirical example of a model engaging in alignment faking without being trained to do

Note: The source does not explicitly state that Anthropic described this as the 'first empirical example' of alignment faking without training. It only mentions that the phenomenon wasn't explicitly programmed into the models. The source does not contain the critics' argument that the behaviors themselves indicate unresolved alignment challenges.

Debug info

Record type: citation

Record ID: page:anthropic:fn36

Source Check: Anthropic - Footnote 36 | Longterm Wiki