1 evidence check
Last checked: 4/3/2026
The source does not explicitly state that Anthropic described this as the 'first empirical example' of alignment faking without training. It only mentions that the phenomenon wasn't explicitly programmed into the models. The source does not contain the critics' argument that the behaviors themselves indicate unresolved alignment challenges.
Evidence — 1 source, 1 check
Note: The source does not explicitly state that Anthropic described this as the 'first empirical example' of alignment faking without training. It only mentions that the phenomenon wasn't explicitly programmed into the models. The source does not contain the critics' argument that the behaviors themselves indicate unresolved alignment challenges.
Debug info
Record type: citation
Record ID: page:anthropic:fn36