Alignment Faking
EvaluationemergingResearch on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment.
Organizations
4
Key Papers
1
Tags
evaluationsdeceptionalignment
Key Papers & Resources1
SEMINAL
Alignment Faking in Large Language Models
Greenblatt et al. (Anthropic)2024