Co-authored Risks from Learned Optimization (2019) introducing mesa-optimization and deceptive alignment; led Sleeper Agents and Alignment Faking research at Anthropic; 3,400+ citations
Evan HubingerPersonEvan HubingerComprehensive biography of Evan Hubinger documenting his influential theoretical work on mesa-optimization/deceptive alignment (2019, 205+ citations) and empirical demonstrations at Anthropic showi...Quality: 43/100 is one of the most influential researchers working on technical AI alignment, known for developing foundational theoretical frameworks and pioneering empirical approaches to understanding alignment failures. As Head of Alignment Stress-Testing at AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials ($380B valuation, $14B ARR at Series G growing to $19B by March 2026), safety research (Constitutional AI, mechanistic interpretability...Quality: 74/100, he leads efforts to identify weaknesses in alignment techniques before they fail in deployment.
Hubinger's career spans both theoretical and empirical alignment research. At MIRIOrganizationMachine Intelligence Research InstituteComprehensive organizational history documenting MIRI's trajectory from pioneering AI safety research (2000-2020) to policy advocacy after acknowledging research failure, with detailed financial da...Quality: 50/100, he co-authored "Risks from Learned Optimization" (2019), which introduced the concepts of mesa-optimizationRiskMesa-OptimizationMesa-optimization—where AI systems develop internal optimizers with different objectives than training goals—shows concerning empirical evidence: Claude exhibited alignment faking in 12-78% of moni...Quality: 63/100 and deceptive alignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 that have shaped how the field thinks about inner alignment failures. At Anthropic, he has led groundbreaking empirical work including the "Sleeper Agents" paper demonstrating that deceptive behaviors can persist through safety training, and the "Alignment Faking" research showing that large language modelsCapabilityLarge Language ModelsComprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to excee...Quality: 60/100 can spontaneously fake alignment without being trained to do so.
His research has accumulated over 3,400 citations, and his concepts—mesa-optimizer, inner alignment, deceptive alignment, model organisms of misalignmentAnalysisModel Organisms of MisalignmentModel organisms of misalignment is a research agenda creating controlled AI systems exhibiting specific alignment failures as testbeds. Recent work achieves 99% coherence with 40% misalignment rate...Quality: 65/100—have become standard vocabulary in AI safety discussions. He represents a unique bridge between MIRI's theoretical tradition and the empirical approach favored by frontier AI labs.
Mesa-optimization framework, model organisms approach
Foundational conceptual contributions
Current Role
Head of Alignment Stress-Testing, Anthropic
Leading empirical alignment testing
Citations
3,400+ (Google Scholar)
High impact for safety researcher
Risk Assessment
Concerned about deceptive alignment
Views it as tractable but dangerous
Alignment Approach
Prosaic alignment, stress-testing
Influenced by Paul ChristianoPersonPaul ChristianoComprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher o...Quality: 39/100
Notable ForCo-authored Risks from Learned Optimization (2019) introducing mesa-optimization and deceptive alignment; led Sleeper Agents and Alignment Faking research at Anthropic; 3,400+ citations