Longterm Wiki

Expert Positions1 topic

Topic	View	Estimate	Confidence	Date	Source	Source check
Likelihood of Deceptive Alignment	Possible	40%	medium	2019	Risks from Learned Optimization

Education

B.S. Mathematics and Computer Science, Harvey Mudd College (2019), High Distinction, GPA 3.912

From wiki articleRead full article →

Overview

Evan Hubinger is one of the most influential researchers working on technical AI alignment, known for developing foundational theoretical frameworks and pioneering empirical approaches to understanding alignment failures. As Head of Alignment Stress-Testing at Anthropic, he leads efforts to identify weaknesses in alignment techniques before they fail in deployment.

Hubinger's career spans both theoretical and empirical alignment research. At MIRI, he co-authored "Risks from Learned Optimization" (2019), which introduced the concepts of mesa-optimization and deceptive alignment that have shaped how the field thinks about inner alignment failures. At Anthropic, he has led groundbreaking empirical work including the "Sleeper Agents" paper demonstrating that deceptive behaviors can persist through safety training, and the "Alignment Faking" research showing that large language models can spontaneously fake alignment without being trained to do so.

His research has accumulated over 3,400 citations, and his concepts—mesa-optimizer, inner alignment, deceptive alignment, model organisms of misalignment—have become standard vocabulary in AI safety discussions. He represents a unique bridge between MIRI's theoretical tradition and the empirical approach favored by frontier AI labs.

Quick Assessment

Dimension	Assessment	Evidence
Research Focus	Inner alignment, deceptive AI, empirical stress-testing	Bridging theory and experiment
Key Innovation	Mesa-optimization framework, model organisms approach	Foundational conceptual contributions
Current Role	Head of Alignment Stress-Testing, Anthropic	Leading empirical alignment testing
Citations	3,400+ (Google Scholar)	High impact for safety researcher
Risk Assessment	Concerned about deceptive alignment	Views it as tractable but dangerous
Alignment Approach	Prosaic alignment, stress-testing	Influenced by Paul Christiano

Read full article →