Skip to content
Longterm Wiki
EH

Evan Hubinger

Co-authored Risks from Learned Optimization (2019) introducing mesa-optimization and deceptive alignment; led Sleeper Agents and Alignment Faking research at Anthropic; 3,400+ citations

Current Role
Head of Alignment Stress-Testing
Organization
Anthropic

Expert Positions1 topic

TopicViewEstimateConfidenceDateSource
Likelihood of Deceptive AlignmentPossible40%medium2019Risks from Learned Optimization

Education

Harvey Mudd College

From wiki articleRead full article →

Overview

Evan Hubinger is one of the most influential researchers working on technical AI alignment, known for developing foundational theoretical frameworks and pioneering empirical approaches to understanding alignment failures. As Head of Alignment Stress-Testing at Anthropic, he leads efforts to identify weaknesses in alignment techniques before they fail in deployment.

Hubinger's career spans both theoretical and empirical alignment research. At MIRI, he co-authored "Risks from Learned Optimization" (2019), which introduced the concepts of mesa-optimization and deceptive alignment that have shaped how the field thinks about inner alignment failures. At Anthropic, he has led groundbreaking empirical work including the "Sleeper Agents" paper demonstrating that deceptive behaviors can persist through safety training, and the "Alignment Faking" research showing that large language models can spontaneously fake alignment without being trained to do so.

His research has accumulated over 3,400 citations, and his concepts—mesa-optimizer, inner alignment, deceptive alignment, model organisms of misalignment—have become standard vocabulary in AI safety discussions. He represents a unique bridge between MIRI's theoretical tradition and the empirical approach favored by frontier AI labs.

Quick Assessment

DimensionAssessmentEvidence
Research FocusInner alignment, deceptive AI, empirical stress-testingBridging theory and experiment
Key InnovationMesa-optimization framework, model organisms approachFoundational conceptual contributions
Current RoleHead of Alignment Stress-Testing, AnthropicLeading empirical alignment testing
Citations3,400+ (Google Scholar)High impact for safety researcher
Risk AssessmentConcerned about deceptive alignmentViews it as tractable but dangerous
Alignment ApproachProsaic alignment, stress-testingInfluenced by Paul Christiano

Facts

4
People
Role / TitleHead of Alignment Stress-Testing
Employed ByAnthropic
Biographical
Notable ForCo-authored Risks from Learned Optimization (2019) introducing mesa-optimization and deceptive alignment; led Sleeper Agents and Alignment Faking research at Anthropic; 3,400+ citations
EducationHarvey Mudd College