Skip to content
Longterm Wiki

Alignment Faking

Evaluationemerging

Research on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment.

Organizations
4
Key Papers
1
First Proposed: 2024 (Greenblatt et al., Anthropic)
Cluster: Evaluation
Parent Area: AI Evaluations

Tags

evaluationsdeceptionalignment

Key Papers & Resources1

SEMINAL
Alignment Faking in Large Language Models
Greenblatt et al. (Anthropic)2024