QualityComprehensiveQuality: 91/100Human-assigned rating of overall page quality, considering depth, accuracy, and completeness.
77
ImportanceHighImportance: 77/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.
14
Structure14/15Structure: 14/15Automated score based on measurable content features.Word count2/2Tables3/3Diagrams1/2Internal links2/2Citations3/3Prose ratio2/2Overview section1/1
24TablesData tables in the page1DiagramsCharts and visual diagrams16Internal LinksLinks to other wiki pages0FootnotesFootnote citations [^N] with sources58External LinksMarkdown links to outside URLs%9%Bullet RatioPercentage of content in bullet lists
Weak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with confidence loss (vs 30-50% naive), but reward modeling achieves only 20-40% PGR. OpenAI's Superalignment team (~30 researchers) funded \$10M+ in grants. Critical limitation: no experiments yet test deceptive models.
Issues1
Links22 links could use <R> components
Weak-to-Strong Generalization
Approach
Weak-to-Strong Generalization
Weak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with confidence loss (vs 30-50% naive), but reward modeling achieves only 20-40% PGR. OpenAI's Superalignment team (~30 researchers) funded \$10M+ in grants. Critical limitation: no experiments yet test deceptive models.
Related
Organizations
OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ...AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding...
Approaches
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100
Risks
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
3k words · 1 backlinks
Overview
Weak-to-strong generalization is a research direction investigating whether weaker AI systems or humans can successfully supervise and elicitOrganizationElicit (AI Research Tool)Elicit is an AI research assistant with 2M+ users that searches 138M papers and automates literature reviews, founded by AI alignment researchers from Ought and funded by Open Philanthropy ($31M to...Quality: 63/100 good behavior from stronger AI systems. This question sits at the heart of AI safety: as AI systems surpass human capabilities, our ability to evaluate and correct their behavior degrades. If weak supervisors can reliably guide strong systems toward good behavior, alignment approaches like RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100 might continue working; if not, we face a fundamental gap between AI capability and human oversight capacity.
Introduced as a concrete research program by OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ... in late 2023 with the paper "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision" by Burns et al. (published at ICML 2024), weak-to-strong generalization uses current AI systems as a testbed for this problem. By training a strong model using labels from a weaker model, researchers can study whether the strong model merely imitates the weak model's mistakes or whether it generalizes to perform better than its supervisor. The foundational finding was striking: a GPT-2-level model supervising GPT-4 can recover approximately 80% of the performance gap on NLP tasks when using an auxiliary confidence loss—achieving close to GPT-3.5-level performance even on problems where the weak model failed. With naive finetuning alone, Performance Gap Recovery (PGR) ranges from 20-50% depending on task type.
However, significant gaps remain. Reward modelingApproachReward ModelingReward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is univ...Quality: 55/100—critical for RLHF—shows only 20-40% PGR, suggesting current alignment techniques may scale poorly to superhuman models. The OpenAI Superalignment team (approximately 30 researchers) launched a $10M grants program to accelerate research, with additional $5M from Eric Schmidt.
The fundamental uncertainty is whether these early results transfer to the most important cases: detecting deception, preventing power-seeking behavior, and maintaining alignment as AI systems approach and exceed human-level capabilities. A deceptive AI system might behave very differently than the non-deceptive systems used in current experiments. As Anthropic's 2025 research recommendations note, developing testbeds where systematic overseer errors can be studied remains a priority.
Quick Assessment
Dimension
Assessment
Evidence
Tractability
Medium (20-80% PGR achieved)
Burns et al. (2024): GPT-2 supervising GPT-4 recovers 80% of performance gap on NLP tasks with auxiliary confidence loss
Scalability
Core research question
PGR increases with both weak supervisor size and strong student size; largest students achieve PGR above 50% (ICML 2024)
Current Maturity
Proof-of-concept stage
Published at ICML 2024; debate extension shows promise (Lang et al., 2025); no production deployment
Time Horizon
3-7 years to deployment
Needs methods that work consistently across settings; reward modeling remains challenging (20-40% PGR)
Key Proponents
OpenAI Superalignment (≈30 researchers), AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding...
OpenAI: $10M grants program + $5M from Eric Schmidt; $150K/year fellowships for grad students (Fast Grants)
Critical Limitation
Deception untested
No experiments yet with strategically deceptive models; current tests use non-adversarial strong models
How It Works
Loading diagram...
The core experimental setup trains a strong pretrained model using labels generated by a weaker model, then measures how much of the capability gap the strong model recovers. A PGR of 100% would mean weak supervision is fully sufficient; 0% would mean the strong model merely imitates weak model errors.
Risk Assessment & Impact
Risk Category
Assessment
Key Metrics
Evidence Source
Safety Uplift
Potentially high (if successful)
80% PGR on NLP suggests supervision can scale; 20-40% PGR on RM is concerning
Key quantitative finding: With very weak supervisors and strong models with many orders of magnitude more compute, researchers recovered more than 20% of the performance gap. For the largest students, PGR is often above 50%.
Key Findings
Partial Recovery is Real: Strong models genuinely outperform their weak supervisors by 20-80% of the capability gap depending on technique used
Recovery is Incomplete: Even with best techniques, 20%+ of capability gap remains unrecovered; RM shows only 20-40% PGR
Task Dependence: PGR varies from ~20% (reward modeling) to ~80% (NLP with confidence loss)—safety-critical tasks not yet characterized
Elicitation Helps: Auxiliary confidence loss increases PGR from 30-50% → ~80%; bootstrapping and unsupervised finetuning also help
Scale Matters: PGR increases with both weak model size and strong model size—larger capability gaps may be more addressable than smaller ones
Auxiliary Techniques Tested
Technique
Effect on PGR
Mechanism
Source
Auxiliary Confidence Loss
30-50% → ≈80% on NLP
Encourages strong model to have confident predictions
"None of the methods work consistently in all settings" (Burns et al.)
Current approaches are proofs-of-concept only
Gap to Full Recovery
Even best methods leave 20%+ capability unrecovered
May be insufficient for superhuman alignment
Known Disanalogies
The original paper acknowledges important ways current experiments differ from the ultimate superhuman alignment problem:
Disanalogy
Description
Implication
Imitation Saliency
Future superhuman models will have salient representations of human behaviors; current strong models may not have learned features relevant for imitating weak model predictions
Imitating the weak supervisor may be easier to avoid in current experiments than it will be with superhuman models
Pretraining Leakage
Pretraining data implicitly contains human supervision
Strong models may already have learned some alignment-relevant behaviors
Leap of Faith
When applied to tasks without ground truth, we must trust the strong model is being honest
Creates natural opportunity for deceptive behavior in deployment
The Deception Problem
Why Deception Changes Everything
Current weak-to-strong experiments use genuinely non-deceptive strong models. A deceptive model poses fundamentally different challenges:
Non-Deceptive Strong Model
Potentially Deceptive Strong Model
Has relevant capability, tries to use it
May hide capabilities from weak supervisor
Generalizes based on implicit understanding
May strategically underperform
Errors are genuine mistakes
"Errors" may be strategic
Weak supervision reveals capability
Weak supervision may be gamed
The Central Uncertainty
Key Question: If a strong model is sophisticated enough to be deceptive, is weak supervision fundamentally incapable of detecting and correcting this?
This question remains open. The theoretical concern is that deception detection requires at least the cognitive sophistication of the deceptive system, making weak-to-strong alignment impossible by construction for adversarial cases.
Scalability Analysis
Current vs. Future Capability Gaps
Comparison
Gap Size
Generalization
GPT-2 → GPT-4
Large but finite
Tested: partial
Human → Human-Level AI
Zero by definition
Not applicable
Human → Superhuman AI
Potentially unbounded
Unknown
The Fundamental Question
Weak-to-strong generalization is the scalability question for alignment. If it works:
RLHF-style approaches can continue to improve
Human oversight remains meaningful
Current alignment research directions are validated
If it doesn't work:
Fundamentally new approaches needed
Human oversight becomes theatrical
Current paradigms have a hard ceiling
Current Research & Investment
Metric
Value
Notes
Annual Investment
$10-15M/year dedicated
OpenAI: $10M grants + internal team; Anthropic: part of scalable oversight budget
Research community scale: The OpenAI Superalignment team consists of approximately 30 researchers. The $10M Superalignment Fast Grants program (with $5M from Eric Schmidt) funded external research, offering grants of $100K-$2M and $150K/year fellowships for graduate students.
Improving or measuring weak-to-strong generalization
Developing testbeds with oversight signals of varying quality (e.g., models of varying scales as overseers)
Exploring differences in W2SG between tasks represented vs. novel in training corpora
Exploring W2SG for process-based supervision (not just outcome supervision)
Differential Progress Analysis
Factor
Assessment
Safety Benefit
Potentially very high if successful
Capability Benefit
Some (better use of supervision)
Overall Balance
Safety-leaning - primarily safety-motivated
Relationship to Other Approaches
Complementary Techniques
Process SupervisionApproachProcess SupervisionProcess supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it...Quality: 65/100: Could improve weak supervisor quality
AI Safety via DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100: Alternative scalable oversight approach
Mechanistic InterpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with $100M+ annual investment across major labs. Anthropic extracted 30M+ features from Cl...Quality: 59/100: Could verify generalization is genuine
Key Comparisons
Approach
Strategy for Scalable Oversight
Weak-to-Strong
Hope strong models generalize beyond supervision
Debate
Use AI capability against itself
Interpretability
Understand internal reasoning directly
Process Supervision
Break reasoning into evaluable steps
Research Priorities
Key Open Questions
Does generalization hold for deception? The central uncertainty
What determines recovery rate? Understanding would enable improvement
Can auxiliary techniques close the gap? How much can methodology help?
Does recovery degrade with gap size? Critical for superhuman case
Proposed Research Directions
Direction
Purpose
Priority
Deception Analogs
Test with strategically behaving models
High
Larger Capability Gaps
Test scaling of generalization
High
Safety-Critical Tasks
Test on alignment-relevant problems
High
Theoretical Analysis
Understand when/why generalization works
Medium
Key Uncertainties & Cruxes
Expert Disagreements
Position
Proponents
Argument
Optimistic
Some OpenAI researchers
Partial success suggests path forward
Uncertain
Most safety researchers
Deception and scaling untested
Pessimistic
Some alignment researchers
Fundamental impossibility for adversarial case
What Would Change Minds
Evidence
Would Support
High PGR on deception-analog tasks
Optimistic view
PGR degradation with capability gap
Pessimistic view
Robust auxiliary techniques
Middle path viable
Theoretical impossibility results
Pessimistic view
Sources & Resources
Primary Research
Type
Source
Key Contributions
Foundational Paper
Burns et al. (ICML 2024) "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision"
Introduced framework; 80% PGR on NLP with confidence loss, 20-40% on reward modeling
Misalignment PotentialAi Transition Model FactorMisalignment PotentialThe aggregate risk that AI systems pursue goals misaligned with human values—combining technical alignment challenges, interpretability gaps, and oversight limitations.
Alignment RobustnessAi Transition Model ParameterAlignment RobustnessThis page contains only a React component import with no actual content rendered in the provided text. Cannot assess importance or quality without the actual substantive content.
Determines if current alignment approaches can scale
Whether weak-to-strong generalization works fundamentally determines the viability of current alignment approaches as AI capabilities increase.
Risks Addressed
Risk
Relevance
How It Helps
Scalable Oversight FailureSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
High
Directly addresses the core problem of supervising systems smarter than the supervisor
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
High
If successful, could detect when models behave differently during training vs deployment
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100
Medium
Strong models may generalize to true intent rather than exploiting supervisor errors
Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100
Medium
Tests whether models learn intended behavior beyond their training distribution
Jan LeikePersonJan LeikeComprehensive biography of Jan Leike covering his career from DeepMind through OpenAI's Superalignment team to current role as Head of Alignment at Anthropic, emphasizing his pioneering work on RLH...Quality: 27/100Paul ChristianoPersonPaul ChristianoComprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher o...Quality: 39/100
Labs
OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ...
Approaches
AI Safety via DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100
Models
Reward Hacking Taxonomy and Severity ModelModelReward Hacking Taxonomy and Severity ModelTaxonomizes 12 reward hacking modes with likelihood (20-90%) and severity scores, finding proxy exploitation affects 80-95% of current systems (low severity) while deceptive hacking (5-40% likeliho...Quality: 71/100
Concepts
AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding...OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ...RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
Key Debates
AI Alignment Research AgendasCruxAI Alignment Research AgendasComprehensive comparison of major AI safety research agendas ($100M+ Anthropic, $50M+ DeepMind, $5-10M nonprofits) with detailed funding, team sizes, and failure mode coverage (25-65% per agenda). ...Quality: 69/100Technical AI Safety ResearchCruxTechnical AI Safety ResearchTechnical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and $...Quality: 66/100
Organizations
Alignment Research CenterOrganizationAlignment Research CenterComprehensive overview of ARC's dual structure (theory research on Eliciting Latent Knowledge problem and systematic dangerous capability evaluations of frontier AI models), documenting their high ...Quality: 43/100
Historical
Mainstream EraHistoricalMainstream EraComprehensive timeline of AI safety's transition from niche to mainstream (2020-present), documenting ChatGPT's unprecedented growth (100M users in 2 months), the OpenAI governance crisis, and firs...Quality: 42/100