QualityGoodQuality: 65/100Human-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 93
72
ImportanceHighImportance: 72/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.
14
Structure14/15Structure: 14/15Automated score based on measurable content features.Word count2/2Tables3/3Diagrams1/2Internal links2/2Citations3/3Prose ratio2/2Overview section1/1
19TablesData tables in the page1DiagramsCharts and visual diagrams14Internal LinksLinks to other wiki pages0FootnotesFootnote citations [^N] with sources21External LinksMarkdown links to outside URLs%6%Bullet RatioPercentage of content in bullet lists
Process supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it shares RLHF's fundamental limitation: humans cannot verify superhuman reasoning steps, and models might maintain separate internal reasoning from visible chains.
Issues2
QualityRated 65 but structure suggests 93 (underrated by 28 points)
Links10 links could use <R> components
Process Supervision
Approach
Process Supervision
Process supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it shares RLHF's fundamental limitation: humans cannot verify superhuman reasoning steps, and models might maintain separate internal reasoning from visible chains.
Related
Organizations
OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ...
Risks
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100
Approaches
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100
Safety Agendas
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
1.7k words
Quick Assessment
Dimension
Rating
Notes
Tractability
High
Well-established technique; automated methods now available
Scalability
Medium
Limited by human ability to verify superhuman reasoning steps
Current Maturity
Medium-High
Deployed in production (OpenAI o1); active research area
Time Horizon
Now-3 years
Already improving math/coding; broader domains in development
Key Proponents
OpenAI, DeepMind, AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding...
Process supervision is a training technique that rewards AI models for producing correct intermediate reasoning steps, not just correct final answers. While traditional outcome-based training only provides a training signal based on whether the final answer is right or wrong, process supervision evaluates each step in a chain-of-thought reasoning sequence. This approach emerged from research at OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ... and others investigating how to improve mathematical reasoning and code generation.
The key insight is that process supervision makes reasoning transparent and auditable. When a model is trained to show its work and each step is verified, it becomes much harder to arrive at a correct answer through flawed reasoning or to hide problematic logic within a chain of thought. This has clear safety benefits: if we can see and verify each reasoning step, we can catch errors, biases, or potentially deceptive reasoning before it leads to harmful outputs.
However, process supervision shares a fundamental limitation with RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: it requires humans to evaluate reasoning steps. For complex or superhuman reasoning, humans may not be able to verify whether intermediate steps are valid. Additionally, sufficiently sophisticated models might learn to produce reasoning that appears valid while actually being subtly flawed, or maintain separate internal reasoning that differs from the visible chain of thought.
How It Works
Loading diagram...
The core innovation is training a Process Reward Model (PRM) that evaluates each intermediate step rather than just the final answer. OpenAI's foundational Let's Verify Step by Step paper released PRM800K, a dataset of 800,000 step-level correctness labels for mathematical reasoning.
Risks Addressed
Risk
Relevance
How Process Supervision Helps
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100
High
Harder to game step-by-step verification than end-to-end outcomes
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Medium
Makes reasoning chains visible and auditable; catches hidden flawed logic
SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Medium
Visible reasoning makes certain deception strategies more detectable
SycophancyRiskSycophancySycophancy—AI systems agreeing with users over providing accurate information—affects 34-78% of interactions and represents an observable precursor to deceptive alignment. The page frames this as a...Quality: 65/100
Low
Step verification can catch reasoning that reaches user-desired but incorrect conclusions
Risk Assessment & Impact
Risk Category
Assessment
Key Metrics
Evidence Source
Safety Uplift
Medium
More transparent reasoning; harder to hide bad logic
Process supervision makes certain forms of deception harder:
Deception Type
Without Process Supervision
With Process Supervision
Wrong answer, lucky guess
Possible
Blocked
Hidden flawed reasoning
Easy
Harder (reasoning visible)
Subtle logic errors
Undetectable
Can be caught
Remaining Vulnerabilities
Vulnerability
Description
Mitigation
Shown vs. Internal Reasoning
Model might have different internal process
Interpretability research
Subtly Flawed Steps
Individual steps valid but combination problematic
Better PRM training
Evaluator Limitations
Humans/PRMs might miss sophisticated errors
More capable oversight
Relationship to Other Approaches
Complementary Techniques
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: Process supervision extends RLHF to reasoning steps
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100: Can apply principles to reasoning process
Mechanistic InterpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with $100M+ annual investment across major labs. Anthropic extracted 30M+ features from Cl...Quality: 59/100: Could verify internal reasoning matches shown reasoning
Misalignment PotentialAi Transition Model FactorMisalignment PotentialThe aggregate risk that AI systems pursue goals misaligned with human values—combining technical alignment challenges, interpretability gaps, and oversight limitations.
Alignment RobustnessAi Transition Model ParameterAlignment RobustnessThis page contains only a React component import with no actual content rendered in the provided text. Cannot assess importance or quality without the actual substantive content.
Improves transparency but doesn't solve fundamental alignment
Process supervision represents solid incremental progress on making AI reasoning transparent, though it doesn't solve the fundamental challenge of overseeing superhuman systems.
Jan LeikePersonJan LeikeComprehensive biography of Jan Leike covering his career from DeepMind through OpenAI's Superalignment team to current role as Head of Alignment at Anthropic, emphasizing his pioneering work on RLH...Quality: 27/100Chris OlahPersonChris OlahBiographical overview of Chris Olah's career trajectory from Google Brain to co-founding Anthropic, focusing on his pioneering work in mechanistic interpretability including feature visualization, ...Quality: 27/100
Labs
Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100
Approaches
Weak-to-Strong GeneralizationApproachWeak-to-Strong GeneralizationWeak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with co...Quality: 91/100
Models
Reward Hacking Taxonomy and Severity ModelModelReward Hacking Taxonomy and Severity ModelTaxonomizes 12 reward hacking modes with likelihood (20-90%) and severity scores, finding proxy exploitation affects 80-95% of current systems (low severity) while deceptive hacking (5-40% likeliho...Quality: 71/100AI Lab Whistleblower Dynamics ModelModelAI Lab Whistleblower Dynamics ModelAnalyzes whistleblower dynamics in AI labs using expected utility framework, estimating current barriers suppress 70-90% of critical safety information compared to optimal transparency. Quantifies ...Quality: 56/100
Concepts
AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding...OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ...Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Policy
California SB 53PolicyCalifornia SB 53California SB 53 represents the first U.S. state law specifically targeting frontier AI safety through transparency requirements, incident reporting, and whistleblower protections, though it makes ...Quality: 73/100Model RegistriesPolicyModel RegistriesAnalyzes model registries as foundational governance infrastructure across US (≥10^26 FLOP threshold), EU (≥10^25 FLOP), and state-level implementations, showing they enable pre-deployment review a...Quality: 68/100