Jan Leike
jan-leike (E182)← Back to pagePath: /knowledge-base/people/jan-leike/
Page Metadata
{
"id": "jan-leike",
"numericId": null,
"path": "/knowledge-base/people/jan-leike/",
"filePath": "knowledge-base/people/jan-leike.mdx",
"title": "Jan Leike",
"quality": 27,
"importance": 22,
"contentFormat": "article",
"tractability": null,
"neglectedness": null,
"uncertainty": null,
"causalLevel": null,
"lastUpdated": "2026-01-29",
"llmSummary": "Comprehensive biography of Jan Leike covering his career from DeepMind through OpenAI's Superalignment team to current role as Head of Alignment at Anthropic, emphasizing his pioneering work on RLHF and scalable oversight. Documents his May 2024 departure from OpenAI over safety prioritization concerns and identifies weak-to-strong generalization and automated alignment as current research priorities.",
"structuredSummary": null,
"description": "Head of Alignment at Anthropic, formerly led OpenAI's superalignment team",
"ratings": {
"novelty": 2,
"rigor": 3.5,
"actionability": 2,
"completeness": 5
},
"category": "people",
"subcategory": null,
"clusters": [
"ai-safety"
],
"metrics": {
"wordCount": 1118,
"tableCount": 1,
"diagramCount": 0,
"internalLinks": 12,
"externalLinks": 0,
"footnoteCount": 0,
"bulletRatio": 0.61,
"sectionCount": 26,
"hasOverview": false,
"structuralScore": 5
},
"suggestedQuality": 33,
"updateFrequency": 21,
"evergreen": true,
"wordCount": 1118,
"unconvertedLinks": [],
"unconvertedLinkCount": 0,
"convertedLinkCount": 0,
"backlinkCount": 5,
"redundancy": {
"maxSimilarity": 15,
"similarPages": [
{
"id": "ilya-sutskever",
"title": "Ilya Sutskever",
"path": "/knowledge-base/people/ilya-sutskever/",
"similarity": 15
},
{
"id": "paul-christiano",
"title": "Paul Christiano",
"path": "/knowledge-base/people/paul-christiano/",
"similarity": 15
},
{
"id": "connor-leahy",
"title": "Connor Leahy",
"path": "/knowledge-base/people/connor-leahy/",
"similarity": 14
},
{
"id": "dan-hendrycks",
"title": "Dan Hendrycks",
"path": "/knowledge-base/people/dan-hendrycks/",
"similarity": 14
},
{
"id": "dario-amodei",
"title": "Dario Amodei",
"path": "/knowledge-base/people/dario-amodei/",
"similarity": 14
}
]
}
}Entity Data
{
"id": "jan-leike",
"type": "person",
"title": "Jan Leike",
"description": "Jan Leike is the Head of Alignment at Anthropic, where he leads research on ensuring AI systems remain beneficial as they become more capable. Before joining Anthropic in 2024, he co-led OpenAI's Superalignment team, which was tasked with solving alignment for superintelligent AI systems within four years.\n\nLeike's research has been foundational for modern alignment techniques. He co-authored key papers on learning from human feedback, including \"Deep Reinforcement Learning from Human Preferences\" which helped establish RLHF as the dominant paradigm for aligning large language models. His work on scalable oversight explores how to maintain human control over AI systems even when they become too capable for humans to directly evaluate their outputs.\n\nLeike's departure from OpenAI in May 2024 was publicly significant - he stated that safety had \"taken a backseat to shiny products\" and that the company was not adequately preparing for the challenges of superintelligence. His move to Anthropic, along with several colleagues from the Superalignment team, signaled broader concerns about safety culture at frontier labs. At Anthropic, he continues work on scalable oversight, weak-to-strong generalization, and detecting deceptive behavior in AI systems.\n",
"tags": [
"rlhf",
"scalable-oversight",
"superalignment",
"reward-modeling",
"weak-to-strong-generalization",
"process-supervision",
"deception"
],
"relatedEntries": [
{
"id": "anthropic",
"type": "lab"
},
{
"id": "scalable-oversight",
"type": "safety-agenda"
},
{
"id": "dario-amodei",
"type": "researcher"
},
{
"id": "paul-christiano",
"type": "researcher"
}
],
"sources": [
{
"title": "Jan Leike on X/Twitter",
"url": "https://twitter.com/janleike"
},
{
"title": "Deep RL from Human Preferences",
"url": "https://arxiv.org/abs/1706.03741"
},
{
"title": "OpenAI Superalignment Announcement",
"url": "https://openai.com/blog/introducing-superalignment"
},
{
"title": "Departure Statement",
"url": "https://twitter.com/janleike/status/1790517668677865835"
}
],
"lastUpdated": "2025-12",
"website": "https://anthropic.com",
"customFields": []
}Canonical Facts (0)
No facts for this entity
External Links
{
"wikidata": "https://www.wikidata.org/wiki/Q123130693"
}Backlinks (5)
| id | title | type | relationship |
|---|---|---|---|
| anthropic | Anthropic | lab | research |
| openai | OpenAI | lab | research |
| dario-amodei | Dario Amodei | researcher | — |
| ilya-sutskever | Ilya Sutskever | researcher | — |
| paul-christiano | Paul Christiano | researcher | — |
Frontmatter
{
"title": "Jan Leike",
"description": "Head of Alignment at Anthropic, formerly led OpenAI's superalignment team",
"sidebar": {
"order": 4
},
"quality": 27,
"llmSummary": "Comprehensive biography of Jan Leike covering his career from DeepMind through OpenAI's Superalignment team to current role as Head of Alignment at Anthropic, emphasizing his pioneering work on RLHF and scalable oversight. Documents his May 2024 departure from OpenAI over safety prioritization concerns and identifies weak-to-strong generalization and automated alignment as current research priorities.",
"lastEdited": "2026-01-29",
"importance": 22,
"update_frequency": 21,
"ratings": {
"novelty": 2,
"rigor": 3.5,
"actionability": 2,
"completeness": 5
},
"clusters": [
"ai-safety"
],
"entityType": "person"
}Raw MDX Source
---
title: Jan Leike
description: Head of Alignment at Anthropic, formerly led OpenAI's superalignment team
sidebar:
order: 4
quality: 27
llmSummary: Comprehensive biography of Jan Leike covering his career from DeepMind through OpenAI's Superalignment team to current role as Head of Alignment at Anthropic, emphasizing his pioneering work on RLHF and scalable oversight. Documents his May 2024 departure from OpenAI over safety prioritization concerns and identifies weak-to-strong generalization and automated alignment as current research priorities.
lastEdited: "2026-01-29"
importance: 22
update_frequency: 21
ratings:
novelty: 2
rigor: 3.5
actionability: 2
completeness: 5
clusters: ["ai-safety"]
entityType: person
---
import {DataInfoBox, DataExternalLinks, EntityLink} from '@components/wiki';
<DataExternalLinks pageId="jan-leike" />
<DataInfoBox entityId="E182" />
## Background
Jan Leike is a leading <EntityLink id="E439">AI alignment</EntityLink> researcher currently serving as Head of Alignment at <EntityLink id="E22">Anthropic</EntityLink>. He has a PhD from Australian National University, where he worked on AI safety under Marcus Hutter.
His career has been defined by practical, empirical approaches to alignment:
- Early work on safe exploration in reinforcement learning
- Pioneering research on learning from human feedback
- Leadership of alignment teams at DeepMind, <EntityLink id="E218">OpenAI</EntityLink>, and now Anthropic
- Focus on scalable methods that can work with current ML paradigms
## Career Trajectory
### DeepMind (2017-2021)
Worked on the first implementations of learning from human feedback, including:
- Safe exploration methods
- <EntityLink id="E600">Reward modeling</EntityLink>
- Scalable agent alignment
### OpenAI (2021-2024)
- Joined to lead alignment research
- Co-led the Superalignment team (announced July 2023)
- Secured 20% of OpenAI's compute for alignment research
- Departed May 2024 over disagreements about safety prioritization
### Anthropic (2024-present)
- Joined as Head of Alignment
- Reunited with former OpenAI colleagues
- Leading alignment research on Claude and future systems
## Key Contributions
### <EntityLink id="E259">RLHF</EntityLink> Pioneer
Jan was one of the early researchers to demonstrate that reinforcement learning from human feedback (RLHF) could work at scale:
- Co-authored seminal papers on reward learning
- Showed how to train language models to be helpful and harmless
- Methods he developed became standard across industry
### <EntityLink id="E271">Scalable Oversight</EntityLink> Research
Core focus on how to supervise AI systems more capable than humans:
- Recursive reward modeling
- AI-assisted human evaluation
- <EntityLink id="E455">Process supervision</EntityLink> vs. outcome supervision
- <EntityLink id="E452">Weak-to-strong generalization</EntityLink>
### Superalignment Vision
At OpenAI, co-led (with <EntityLink id="E163">Ilya Sutskever</EntityLink>) the Superalignment team, which aimed to:
- Solve alignment for superintelligent systems
- Use AI to help align even more capable AI
- Achieve this within four years
- Dedicate significant compute resources to the problem
## Views on Key Cruxes
### Risk Assessment
Jan Leike's public statements and research priorities reveal a perspective characterized by high urgency tempered with cautious optimism about technical solutions.
| Expert/Source | Estimate | Reasoning |
|---------------|----------|-----------|
| Alignment urgency | Very high | Jan's May 2024 departure from OpenAI was driven by concerns about insufficient safety prioritization. His public statements emphasized that "building smarter-than-human machines is an inherently dangerous endeavor" and criticized OpenAI for letting "safety culture and processes take a backseat to shiny products." This departure, alongside Ilya Sutskever's exit, signaled deep concern about the urgency of alignment work relative to capability development. |
| Timeline pressure | Next 3-5 years critical | Throughout 2024, Jan repeatedly emphasized the need to solve alignment soon, suggesting we have limited time before transformative AI arrives. The Superalignment team he co-led at OpenAI aimed to solve superintelligence alignment within four years, reflecting his belief that the window for solving these problems is narrow. His move to Anthropic suggests he believes this timeline remains realistic but requires serious institutional commitment. |
| Technical tractability | Difficult but solvable | Despite the urgency, Jan maintains optimism about scalable oversight approaches. His research focus on weak-to-strong generalization, RLHF improvements, and automated alignment research reflects a belief that alignment is tractable through empirical work on increasingly capable systems. However, he acknowledges that current methods like RLHF are insufficient for superintelligence without major improvements, suggesting the problem is solvable but requires significant innovation. |
### Core Beliefs
1. **Alignment is urgent**: We have limited time to solve this before transformative AI
2. **Scalable oversight is key**: Central challenge is supervising superhuman AI
3. **Empirical work is essential**: Need to test alignment techniques on increasingly capable systems
4. **Safety must be prioritized**: Cannot let capability research consistently outpace safety
5. **Current methods are insufficient**: RLHF and similar techniques won't scale to superintelligence without major improvements
### Why He Left OpenAI
In May 2024, Jan departed OpenAI and posted on X (Twitter):
- "Building smarter-than-human machines is an inherently dangerous endeavor"
- "Over the past years, safety culture and processes have taken a backseat to shiny products"
- Expressed concern about compute and priority allocation for safety
This departure, along with Ilya Sutskever's, raised significant questions about OpenAI's commitment to safety research.
## Research Focus
### Current Priorities at Anthropic
1. **Weak-to-strong generalization**: How can weaker systems (including humans) effectively supervise stronger ones?
2. **Scalable oversight techniques**: Making human feedback work for superhuman systems
3. **Honest AI systems**: Ensuring AI systems accurately report their reasoning and limitations
4. **Automated alignment research**: Using AI to help solve alignment
### Key Technical Problems
Jan has identified several crucial challenges:
- **<EntityLink id="E253">Reward hacking</EntityLink>**: Systems optimizing proxies rather than true objectives
- **<EntityLink id="E105">Distributional shift</EntityLink>**: Maintaining alignment in novel situations
- **<EntityLink id="E93">Deceptive alignment</EntityLink>**: Preventing systems from appearing aligned while pursuing other goals
- **Superalignment**: Aligning systems smarter than humans
## Public Communication
Jan is known for:
- Clear, technical communication about alignment challenges
- Willingness to raise concerns publicly
- Engagement on Twitter/X about safety issues
- Focus on concrete, actionable research directions
His departure from OpenAI sparked significant public discussion about AI safety prioritization at major labs.
## Strategic Views
### On AI Development
- **Safety must keep pace**: Capability advances should be matched by safety advances
- **Need serious compute**: Alignment research requires significant computational resources
- **Coordination is important**: Labs should share safety insights
- **Race dynamics are dangerous**: Competition that sacrifices safety is unacceptable
### On Research Approach
- **Empirical and theoretical**: Need both practical testing and conceptual work
- **Learn from current systems**: Can make progress by studying existing models
- **Prepare for qualitative jumps**: Current techniques may not suffice for superintelligence
- **Automate alignment work**: Use AI to scale up alignment research itself
## Influence and Impact
### Research Impact
- RLHF work influenced every major language model deployment
- Scalable oversight framework guides significant research programs
- Superalignment vision shaped discourse on superintelligence alignment
### Field Building
- Mentored numerous alignment researchers
- Built and led multiple alignment teams
- Raised profile of alignment research within major labs
### Institutional Influence
- Secured major compute allocation for alignment at OpenAI
- Helped shape Anthropic's research priorities
- Demonstrated importance of independent safety research
## Key Publications
- **"Deep Reinforcement Learning from Human Preferences"** (2017) - Early RLHF paper
- **"Scalable agent alignment via reward modeling"** (2018) - Reward learning framework
- **"Recursively Summarizing Books with Human Feedback"** (2021) - Demonstrating RLHF scaling
- **Various blog posts** on alignment challenges and approaches
## Current Challenges
At Anthropic, Jan faces several key challenges:
1. **Time pressure**: Transformative AI may arrive soon, requiring rapid progress
2. **Scaling RLHF**: Current techniques may not work for superintelligent systems
3. **Evaluation**: How to know if alignment techniques actually work
4. **Automation**: Using AI to help solve alignment before it becomes too capable
5. **Coordination**: Ensuring insights are shared across safety community