Page StatusContent

Edited 2 weeks ago1.1k words5 backlinks

Updated every 3 weeksDue in 6 days

Summary

Comprehensive biography of Jan Leike covering his career from DeepMind through OpenAI's Superalignment team to current role as Head of Alignment at Anthropic, emphasizing his pioneering work on RLHF and scalable oversight. Documents his May 2024 departure from OpenAI over safety prioritization concerns and identifies weak-to-strong generalization and automated alignment as current research priorities.

Jan Leike

Person

Jan Leike

Wikidata

AffiliationAnthropic

RoleHead of Alignment

Known ForAlignment research, scalable oversight, RLHF, superalignment work

Websiteanthropic.com

Organizations

Safety Agendas

People

1.1k words · 5 backlinks

Person

Jan Leike

Wikidata

AffiliationAnthropic

RoleHead of Alignment

Known ForAlignment research, scalable oversight, RLHF, superalignment work

Websiteanthropic.com

Organizations

Safety Agendas

People

1.1k words · 5 backlinks

Background

Jan Leike is a leading AI alignment researcher currently serving as Head of Alignment at Anthropic. He has a PhD from Australian National University, where he worked on AI safety under Marcus Hutter.

His career has been defined by practical, empirical approaches to alignment:

Early work on safe exploration in reinforcement learning
Pioneering research on learning from human feedback
Leadership of alignment teams at DeepMind, OpenAI, and now Anthropic
Focus on scalable methods that can work with current ML paradigms

Career Trajectory

DeepMind (2017-2021)

Worked on the first implementations of learning from human feedback, including:

Safe exploration methods
Reward modeling
Scalable agent alignment

OpenAI (2021-2024)

Joined to lead alignment research
Co-led the Superalignment team (announced July 2023)
Secured 20% of OpenAI's compute for alignment research
Departed May 2024 over disagreements about safety prioritization

Anthropic (2024-present)

Joined as Head of Alignment
Reunited with former OpenAI colleagues
Leading alignment research on Claude and future systems

Key Contributions

RLHF Pioneer

Jan was one of the early researchers to demonstrate that reinforcement learning from human feedback (RLHF) could work at scale:

Co-authored seminal papers on reward learning
Showed how to train language models to be helpful and harmless
Methods he developed became standard across industry

Scalable Oversight Research

Core focus on how to supervise AI systems more capable than humans:

Recursive reward modeling
AI-assisted human evaluation
Process supervision vs. outcome supervision
Weak-to-strong generalization

Superalignment Vision

At OpenAI, co-led (with Ilya Sutskever) the Superalignment team, which aimed to:

Solve alignment for superintelligent systems
Use AI to help align even more capable AI
Achieve this within four years
Dedicate significant compute resources to the problem

Views on Key Cruxes

Risk Assessment

Jan Leike's public statements and research priorities reveal a perspective characterized by high urgency tempered with cautious optimism about technical solutions.

Expert/Source	Estimate	Reasoning
Alignment urgency	Very high	Jan's May 2024 departure from OpenAI was driven by concerns about insufficient safety prioritization. His public statements emphasized that "building smarter-than-human machines is an inherently dangerous endeavor" and criticized OpenAI for letting "safety culture and processes take a backseat to shiny products." This departure, alongside Ilya Sutskever's exit, signaled deep concern about the urgency of alignment work relative to capability development.
Timeline pressure	Next 3-5 years critical	Throughout 2024, Jan repeatedly emphasized the need to solve alignment soon, suggesting we have limited time before transformative AI arrives. The Superalignment team he co-led at OpenAI aimed to solve superintelligence alignment within four years, reflecting his belief that the window for solving these problems is narrow. His move to Anthropic suggests he believes this timeline remains realistic but requires serious institutional commitment.
Technical tractability	Difficult but solvable	Despite the urgency, Jan maintains optimism about scalable oversight approaches. His research focus on weak-to-strong generalization, RLHF improvements, and automated alignment research reflects a belief that alignment is tractable through empirical work on increasingly capable systems. However, he acknowledges that current methods like RLHF are insufficient for superintelligence without major improvements, suggesting the problem is solvable but requires significant innovation.

Core Beliefs

Alignment is urgent: We have limited time to solve this before transformative AI
Scalable oversight is key: Central challenge is supervising superhuman AI
Empirical work is essential: Need to test alignment techniques on increasingly capable systems
Safety must be prioritized: Cannot let capability research consistently outpace safety
Current methods are insufficient: RLHF and similar techniques won't scale to superintelligence without major improvements

Why He Left OpenAI

In May 2024, Jan departed OpenAI and posted on X (Twitter):

"Building smarter-than-human machines is an inherently dangerous endeavor"
"Over the past years, safety culture and processes have taken a backseat to shiny products"
Expressed concern about compute and priority allocation for safety

This departure, along with Ilya Sutskever's, raised significant questions about OpenAI's commitment to safety research.

Research Focus

Current Priorities at Anthropic

Weak-to-strong generalization: How can weaker systems (including humans) effectively supervise stronger ones?
Scalable oversight techniques: Making human feedback work for superhuman systems
Honest AI systems: Ensuring AI systems accurately report their reasoning and limitations
Automated alignment research: Using AI to help solve alignment

Key Technical Problems

Jan has identified several crucial challenges:

Reward hacking: Systems optimizing proxies rather than true objectives
Distributional shift: Maintaining alignment in novel situations
Deceptive alignment: Preventing systems from appearing aligned while pursuing other goals
Superalignment: Aligning systems smarter than humans

Public Communication

Jan is known for:

Clear, technical communication about alignment challenges
Willingness to raise concerns publicly
Engagement on Twitter/X about safety issues
Focus on concrete, actionable research directions

His departure from OpenAI sparked significant public discussion about AI safety prioritization at major labs.

Strategic Views

On AI Development

Safety must keep pace: Capability advances should be matched by safety advances
Need serious compute: Alignment research requires significant computational resources
Coordination is important: Labs should share safety insights
Race dynamics are dangerous: Competition that sacrifices safety is unacceptable

On Research Approach

Empirical and theoretical: Need both practical testing and conceptual work
Learn from current systems: Can make progress by studying existing models
Prepare for qualitative jumps: Current techniques may not suffice for superintelligence
Automate alignment work: Use AI to scale up alignment research itself

Influence and Impact

Research Impact

RLHF work influenced every major language model deployment
Scalable oversight framework guides significant research programs
Superalignment vision shaped discourse on superintelligence alignment

Field Building

Mentored numerous alignment researchers
Built and led multiple alignment teams
Raised profile of alignment research within major labs

Institutional Influence

Secured major compute allocation for alignment at OpenAI
Helped shape Anthropic's research priorities
Demonstrated importance of independent safety research

Key Publications

"Deep Reinforcement Learning from Human Preferences" (2017) - Early RLHF paper
"Scalable agent alignment via reward modeling" (2018) - Reward learning framework
"Recursively Summarizing Books with Human Feedback" (2021) - Demonstrating RLHF scaling
Various blog posts on alignment challenges and approaches

Current Challenges

At Anthropic, Jan faces several key challenges:

Time pressure: Transformative AI may arrive soon, requiring rapid progress
Scaling RLHF: Current techniques may not work for superintelligent systems
Evaluation: How to know if alignment techniques actually work
Automation: Using AI to help solve alignment before it becomes too capable
Coordination: Ensuring insights are shared across safety community

Jan Leike

Jan Leike

Jan Leike

Background

Career Trajectory

DeepMind (2017-2021)

OpenAI (2021-2024)

Anthropic (2024-present)

Key Contributions

RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100 Pioneer

Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100 Research

Superalignment Vision

Views on Key Cruxes

Risk Assessment

Core Beliefs

Why He Left OpenAI

Research Focus

Current Priorities at Anthropic

Key Technical Problems

Public Communication

Strategic Views

On AI Development

On Research Approach

Influence and Impact

Research Impact

Field Building

Institutional Influence

Key Publications

Current Challenges

Related Pages

Top Related Pages

Paul Christiano

Dario Amodei

Scalable Oversight

Anthropic

Ilya Sutskever

Approaches

Labs

Models

Concepts

Key Debates

Organizations

Risks

Historical

RLHF Pioneer

Scalable Oversight Research