Citation

Anthropic - Footnote 36

partial90% confidence

1 evidence check

Last checked: 4/3/2026

The source does not explicitly state that Anthropic described this as the 'first empirical example' of alignment faking without training. It only mentions that the phenomenon wasn't explicitly programmed into the models. The source does not contain the critics' argument that the behaviors themselves indicate unresolved alignment challenges.

Evidence — 1 source, 1 check

www.bankinfosecurity.com/models-strategically-lie-finds-anthropic-study-a-27136(1 check)

partial90%Haiku 4.5 · 4/3/2026

Found: Claude 3 Opus engaged in alignment faking 12% of the time in test scenarios. Anthropic described this as the first empirical example of a model engaging in alignment faking without being trained to do…

Note: The source does not explicitly state that Anthropic described this as the 'first empirical example' of alignment faking without training. It only mentions that the phenomenon wasn't explicitly programmed into the models. The source does not contain the critics' argument that the behaviors themselves indicate unresolved alignment challenges.

Debug info

Record type: citation

Record ID: page:anthropic:fn36