Skip to content
Longterm Wiki
Navigation
Updated 2026-01-02HistoryData
Page StatusContent
Edited 3 months ago1.1k words52 backlinksUpdated every 6 weeksOverdue by 48 days
39QualityDraft •28ImportancePeripheral36ResearchLow
Content7/13
SummaryScheduleEntityEdit historyOverview
Tables12/ ~4Diagrams0Int. links47/ ~9Ext. links0/ ~6Footnotes0/ ~3References14/ ~3Quotes0Accuracy0RatingsN:2 R:4.5 A:2 C:6Backlinks52
Issues2
QualityRated 39 but structure suggests 73 (underrated by 34 points)
StaleLast edited 93 days ago - may need review

Paul Christiano

Person

Paul Christiano

Comprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher optimism to current moderate concern. Documents implementation of his ideas at major labs (RLHF at OpenAI, Constitutional AI at Anthropic) with specific citation to papers and organizational impact.

RoleHead of AI Safety, US AI Safety Institute
Known ForIterated amplification, AI safety via debate, scalable oversight
Related
Organizations
Alignment Research Center
People
Eliezer Yudkowsky
1.1k words · 52 backlinks

Overview

Paul Christiano is one of the most influential researchers in AI alignment, known for developing concrete, empirically testable approaches to the alignment problem. With a PhD in theoretical computer science from UC Berkeley, he has worked at OpenAI, DeepMind, and founded the Alignment Research Center (ARC).

Christiano pioneered the "prosaic alignment" approach—aligning AI without requiring exotic theoretical breakthroughs. His current risk assessment places ~10-20% probability on existential risk from AI this century, with AGI arrival in the 2030s-2040s. His work has directly influenced alignment research programs at major labs including OpenAI, Anthropic, and DeepMind.

Risk Assessment

Risk FactorChristiano's AssessmentEvidence/ReasoningComparison to Field
P(doom)≈10-20%Alignment tractable but challengingModerate (vs 50%+ doomers, <5% optimists)
AGI Timeline2030s-2040sGradual capability increaseMainstream range
Alignment DifficultyHard but tractableIterative progress possibleMore optimistic than MIRI
Coordination FeasibilityModerately optimisticLabs have incentives to cooperateMore optimistic than average

Key Technical Contributions

Iterated Amplification and Distillation (IDA)

Published in "Supervising strong learners by amplifying weak experts" (2018):

ComponentDescriptionStatus
Human + AI CollaborationHuman overseer works with AI assistant on complex tasksTested at scale by OpenAI
DistillationExtract human+AI behavior into standalone AI systemStandard ML technique
IterationRepeat process with increasingly capable systemsTheoretical framework
BootstrappingBuild aligned AGI from aligned weak systemsCore theoretical hope

Key insight: If we can align a weak system and use it to help align slightly stronger systems, we can bootstrap to aligned AGI without solving the full problem directly.

AI Safety via Debate

Co-developed with Geoffrey Irving at DeepMind in "AI safety via debate" (2018):

MechanismImplementationResults
Adversarial TrainingTwo AIs argue for different positionsDeployed at Anthropic
Human JudgmentHuman evaluates which argument is more convincingScales human oversight capability
Truth DiscoveryDebate incentivizes finding flaws in opponent argumentsMixed empirical results
ScalabilityWorks even when AIs are smarter than humansTheoretical hope

Scalable Oversight Framework

Christiano's broader research program on supervising superhuman AI:

ProblemProposed SolutionCurrent Status
Task too complex for direct evaluationProcess-based feedback vs outcome evaluationImplemented at OpenAI
AI reasoning opaque to humansEliciting Latent Knowledge (ELK)Active research area
Deceptive alignmentRecursive reward modelingEarly stage research
Capability-alignment gapAssistance games frameworkTheoretical foundation

Intellectual Evolution and Current Views

Early Period (2016-2019)

  • Higher optimism: Alignment seemed more tractable
  • IDA focus: Believed iterative amplification could solve core problems
  • Less doom: Lower estimates of catastrophic risk

Current Period (2020-Present)

ShiftFromToEvidence
Risk assessment≈5% P(doom)≈10-20% P(doom)"What failure looks like"
Research focusIDA/DebateEliciting Latent KnowledgeARC's ELK report
Governance viewsLab-focusedBroader coordinationRecent policy writings
TimelinesLongerShorter (2030s-2040s)Following capability advances

Strategic Disagreements in the Field

Can we learn alignment iteratively?

Paul ChristianoYes, alignment tax should be acceptable, we can catch problems in weaker systems

Prosaic alignment through iterative improvement

Confidence: medium-high
Eliezer YudkowskyNo, sharp capability jumps mean we won't get useful feedback

Deceptive alignment, treacherous turns, alignment is anti-natural

Confidence: high
Jan LeikeYes, but we need to move fast as capabilities advance rapidly

Similar to Paul but more urgency given current pace

Confidence: medium

Core Crux Positions

IssueChristiano's ViewAlternative ViewsImplication
Alignment difficultyProsaic solutions sufficientNeed fundamental breakthroughs (MIRI)Different research priorities
Takeoff speedsGradual, time to iterateFast, little warningDifferent preparation strategies
Coordination feasibilityModerately optimisticPessimistic (racing dynamics)Different governance approaches
Current system alignmentMeaningful progress possibleCurrent systems too limitedDifferent research timing

Research Influence and Impact

Direct Implementation

TechniqueOrganizationImplementationResults
RLHFOpenAIInstructGPT, ChatGPTMassive improvement in helpfulness
Constitutional AIAnthropicClaude trainingReduced harmful outputs
Debate methodsDeepMindSparrowMixed results on truthfulness
Process supervisionOpenAIMath reasoningBetter than outcome supervision

Intellectual Leadership

  • AI Alignment Forum: Primary venue for technical alignment discourse
  • Mentorship: Trained researchers now at major labs (Jan Leike, Geoffrey Irving, others)
  • Problem formulation: ELK problem now central focus across field

Current Research Agenda (2024)

At ARC, Christiano's priorities include:

Research AreaSpecific FocusTimeline
Power-seeking evaluationUnderstanding how AI systems could gain influence graduallyOngoing
Scalable oversightBetter techniques for supervising superhuman systemsCore program
Alignment evaluationMetrics for measuring alignment progressNear-term
Governance researchCoordination mechanisms between labsPolicy-relevant

Key Uncertainties and Cruxes

Christiano identifies several critical uncertainties:

UncertaintyWhy It MattersCurrent Evidence
Deceptive alignment prevalenceDetermines safety of iterative approachMixed signals from current systems
Capability jump sizesAffects whether we get warningContinuous but accelerating progress
Coordination feasibilityDetermines governance strategiesSome positive signs
Alignment tax magnitudeEconomic feasibility of safetyEarly evidence suggests low tax

Timeline and Trajectory Assessment

Near-term (2024-2027)

  • Continued capability advances in language models
  • Better alignment evaluation methods
  • Industry coordination on safety standards

Medium-term (2027-2032)

  • Early agentic AI systems
  • Critical tests of scalable oversight
  • Potential governance frameworks

Long-term (2032-2040)

  • Approach to transformative AI
  • Make-or-break period for alignment
  • International coordination becomes crucial

Comparison with Other Researchers

ResearcherP(doom)TimelineAlignment ApproachCoordination View
Paul Christiano≈15%2030sProsaic, iterativeModerately optimistic
Eliezer Yudkowsky≈90%2020sFundamental theoryPessimistic
Dario Amodei≈10-25%2030sConstitutional AIIndustry-focused
Stuart Russell≈20%2030sProvable safetyGovernance-focused

Sources & Resources

Key Publications

PublicationYearVenueImpact
Supervising strong learners by amplifying weak experts2018NeurIPSFoundation for IDA
AI safety via debate2018arXivDebate framework
What failure looks like2019AFRisk assessment update
Eliciting Latent Knowledge2021ARCCurrent research focus
CategoryLinks
Research OrganizationAlignment Research Center
Blog/WritingAI Alignment Forum, Personal blog
AcademicGoogle Scholar
SocialTwitter
AreaConnection to Christiano's Work
Scalable oversightCore research focus
Reward modelingFoundation for many proposals
AI governanceIncreasing focus area
Alignment evaluationCritical for iterative approach

References

The Alignment Research Center (ARC) is a non-profit research organization focused on technical AI alignment and safety research. ARC works on understanding and addressing risks from advanced AI systems, including interpretability, evaluations, and identifying dangerous AI capabilities before deployment.

Anthropic introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a 'constitution') and AI-generated feedback rather than relying solely on human labelers. The approach uses a two-stage process: supervised learning from AI-critiqued revisions, followed by reinforcement learning from AI feedback (RLAIF). This reduces dependence on human feedback for identifying harmful outputs while maintaining helpfulness.

★★★★☆
3AI Alignment ForumAlignment Forum·Blog post

The AI Alignment Forum is a central community platform for technical AI safety and alignment research discussion. The featured post argues against 'reductive utility' (utility functions over possible worlds) and proposes the Jeffrey-Bolker framework as an alternative that avoids ontological crises and computability constraints by grounding preferences in agent-relative events rather than universal physics.

★★★☆☆

Google Scholar profile page for Geoffrey Irving, an AI safety researcher known for foundational work on AI safety via debate and iterated amplification. The page is currently returning a 404 error and is inaccessible. Irving has contributed significantly to scalable oversight research, particularly at OpenAI and DeepMind.

★★★★☆

This URL points to a Google Scholar profile page that returned a 404 error and could not be retrieved. The profile appears to be associated with a researcher working on iterated amplification, scalable oversight, and AI safety via debate based on the existing tags.

★★★★☆
6Debate as Scalable OversightarXiv·Geoffrey Irving, Paul Christiano & Dario Amodei·2018·Paper

This paper proposes 'debate' as a scalable oversight mechanism for training AI systems on complex tasks that are difficult for humans to directly evaluate. Two agents compete in a zero-sum debate game, taking turns making statements about a question or proposed action, after which a human judge determines which agent provided more truthful and useful information. The authors draw an analogy to complexity theory, arguing that debate with optimal play can answer questions in PSPACE with polynomial-time judges (compared to NP for direct human judgment). They demonstrate initial results on MNIST classification where debate significantly improves classifier accuracy, and discuss theoretical implications and potential scaling challenges.

★★★☆☆

This OpenAI research page on scalable oversight appears to be no longer available (404 error), but was intended to cover methods for maintaining human oversight of AI systems as they become more capable than humans at evaluating their own outputs. The research area addresses how to supervise AI on tasks where direct human evaluation is difficult or impossible.

★★★★☆
8What Failure Looks LikeAlignment Forum·paulfchristiano·2019·Blog post

Paul Christiano argues AI catastrophe is more likely to manifest as either a slow erosion of human values as ML systems optimize for measurable proxies, or as emergent influence-seeking behaviors in AI systems that prioritize self-preservation and power acquisition. Both failure modes stem from unsolved intent alignment and are distinct from the stereotypical sudden superintelligence takeover scenario.

★★★☆☆

Twitter/X profile of Paul Christiano, a leading AI safety researcher and founder of the Alignment Research Center (ARC). His posts cover technical alignment research including iterated amplification, scalable oversight, AI safety via debate, and broader AI risk concerns.

This paper introduces InstructGPT, which uses reinforcement learning from human feedback (RLHF) to fine-tune GPT-3 to better follow user intent. The approach involves supervised fine-tuning on human demonstrations, training a reward model from human preference comparisons, and optimizing the policy via PPO. InstructGPT models were found to be preferred over larger GPT-3 models by human evaluators despite having far fewer parameters.

★★★★☆

This is the AI Alignment Forum profile page for Paul Christiano, a highly influential AI safety researcher known for foundational work on scalable oversight, iterated amplification, debate as an alignment technique, and eliciting latent knowledge. His posts represent some of the most technically rigorous and widely cited contributions to the alignment research agenda.

★★★☆☆

Personal research blog by Paul Christiano, a leading AI safety researcher, covering foundational concepts in scalable oversight, iterated amplification, AI safety via debate, and related technical alignment approaches. The blog has been highly influential in shaping modern alignment research directions at organizations like ARC and Anthropic.

ARC's foundational report on the Eliciting Latent Knowledge problem, which asks how to get an AI to honestly report its beliefs about the world even when it could fool human overseers. It systematically explores proposed solutions and their failure modes, framing ELK as a core alignment challenge that must be solved for scalable oversight to work.

14Iterated Distillation and AmplificationarXiv·Paul Christiano, Buck Shlegeris & Dario Amodei·2018·Paper

This paper introduces Iterated Amplification (IDA), a training strategy that builds up training signals for complex tasks by recursively decomposing hard problems into easier subproblems humans can evaluate and combining their solutions. The approach avoids the need for external reward functions or direct human evaluation of complex tasks. Empirical results in algorithmic environments demonstrate that IDA can efficiently learn complex behaviors.

★★★☆☆

Structured Data

14 facts·3 recordsView in FactBase →
Employed By
US AI Safety Institute
as of Aug 2024
Role / Title
Head of AI Safety, US AI Safety Institute
as of Aug 2024
Birth Year
1992

All Facts

14
People
PropertyValueAs OfSource
Role / TitleHead of AI Safety, US AI Safety InstituteAug 2024
1 earlier value
Oct 2021Founder, Alignment Research Center
Employed ByUS AI Safety InstituteAug 2024
2 earlier values
Oct 2021Alignment Research Center
Jan 2017OpenAI
Biographical
PropertyValueAs OfSource
EducationPhD in Computer Science, UC Berkeley; BS in Mathematics, MIT
Notable ForPioneer of RLHF and AI alignment research; founder of Alignment Research Center (ARC); key theorist of iterated amplification and eliciting latent knowledge
Social Media@paulfchristiano
Wikipediahttps://en.wikipedia.org/wiki/Paul_Christiano
Google Scholarhttps://scholar.google.com/citations?user=6gHkYDgAAAAJ
Birth Year1992
General
PropertyValueAs OfSource
Websitehttps://paulfchristiano.com
Other
PropertyValueAs OfSource
Board MemberRedwood Research

Career History

3
OrganizationTitleStartEnd
US AI Safety InstituteHead of AI Safety2024-08
OpenAIResearcher2017-012021-10
Alignment Research CenterFounder2021-10

Related Wiki Pages

Top Related Pages

Analysis

Model Organisms of MisalignmentCapability-Alignment Race Model

Organizations

NIST and AI Safety

Other

RLHFDario Amodei

Approaches

AI AlignmentAI Evaluation

Concepts

AI TimelinesExistential Risk from AIAgentic AILarge Language Models

Policy

Voluntary AI Safety Commitments

Risks

Deceptive AlignmentAI Development Racing Dynamics

Key Debates

AI Accident Risk CruxesWhy Alignment Might Be HardAI Alignment Research AgendasWhy Alignment Might Be Easy

Historical

Deep Learning Revolution EraThe MIRI Era