Edited 2 months ago1.7k words12 backlinksUpdated every 6 weeksOverdue by 22 days
65QualityGood •Quality: 65/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 10048.5ImportanceReferenceImportance: 48.5/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.33ResearchLowResearch Value: 33/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Content7/13
SummarySummaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables18/ ~7TablesData tables for structured comparisons and reference material.Diagrams1/ ~1DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.–Int. links10/ ~14Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Add links to other wiki pagesExt. links21/ ~8Ext. linksLinks to external websites, papers, and resources outside the wiki.Footnotes0/ ~5FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citations–References4/ ~5ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:4.5 R:5 A:5.5 C:6RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks12BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Issues3
QualityRated 65 but structure suggests 100 (underrated by 35 points)
Links10 links could use <R> components
StaleLast edited 67 days ago - may need review
Process Supervision
Approach
Process Supervision
Process supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it shares RLHF's fundamental limitation: humans cannot verify superhuman reasoning steps, and models might maintain separate internal reasoning from visible chains.
Related
Organizations
OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to Public Benefit Corporation, with detailed analysis of governance crisis, 2024-2025 ownership restructuri...Quality: 62/100
Risks
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100
Research Areas
RLHFResearch AreaRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100Scalable OversightResearch AreaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
1.7k words · 12 backlinks
Quick Assessment
Dimension
Assessment
Evidence
Tractability
High
Well-established technique; automated methods now available
Scalability
Medium
Limited by human ability to verify superhuman reasoning steps
Current Maturity
Medium-High
Deployed in production (OpenAI o1); active research area
Time Horizon
Now-3 years
Already improving math/coding; broader domains in development
Key Proponents
OpenAI, DeepMind, AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials ($380B valuation, $14B ARR at Series G growing to $19B by March 2026), safety research (Constitutional AI, mechanistic interpretability...Quality: 74/100
Process supervision is a training technique that rewards AI models for producing correct intermediate reasoning steps, not just correct final answers. While traditional outcome-based training only provides a training signal based on whether the final answer is right or wrong, process supervision evaluates each step in a chain-of-thought reasoning sequence. This approach emerged from research at OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to Public Benefit Corporation, with detailed analysis of governance crisis, 2024-2025 ownership restructuri...Quality: 62/100 and others investigating how to improve mathematical reasoning and code generation.
The key insight is that process supervision makes reasoning transparent and auditable. When a model is trained to show its work and each step is verified, it becomes much harder to arrive at a correct answer through flawed reasoning or to hide problematic logic within a chain of thought. This has clear safety benefits: if we can see and verify each reasoning step, we can catch errors, biases, or potentially deceptive reasoning before it leads to harmful outputs.
However, process supervision shares a fundamental limitation with RLHFResearch AreaRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: it requires humans to evaluate reasoning steps. For complex or superhuman reasoning, humans may not be able to verify whether intermediate steps are valid. Additionally, sufficiently sophisticated models might learn to produce reasoning that appears valid while actually being subtly flawed, or maintain separate internal reasoning that differs from the visible chain of thought.
How It Works
Diagram (loading…)
flowchart TD
subgraph Training["Training Pipeline"]
A[Problem] --> B[Model generates solution steps]
B --> C{Step Annotation}
C -->|Human| D[Manual step labels]
C -->|Automated| E[Monte Carlo estimation]
D --> F[Process Reward Model]
E --> F
end
subgraph Deployment["Deployment"]
F --> G[Score each reasoning step]
G --> H{Verification}
H -->|Valid| I[Accept solution]
H -->|Invalid| J[Reject/Resample]
end
subgraph Scaling["Test-Time Scaling"]
K[Generate N solutions] --> L[PRM scores all steps]
L --> M[Select best solution path]
end
style F fill:#4a9eff
style M fill:#22c55e
The core innovation is training a Process Reward Model (PRM) that evaluates each intermediate step rather than just the final answer. OpenAI's foundational Let's Verify Step by Step paper released PRM800K, a dataset of 800,000 step-level correctness labels for mathematical reasoning.
Risks Addressed
Risk
Relevance
How Process Supervision Helps
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100
High
Harder to game step-by-step verification than end-to-end outcomes
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Medium
Makes reasoning chains visible and auditable; catches hidden flawed logic
SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Medium
Visible reasoning makes certain deception strategies more detectable
SycophancyRiskSycophancySycophancy—AI systems agreeing with users over providing accurate information—affects 34-78% of interactions and represents an observable precursor to deceptive alignment. The page frames this as a...Quality: 65/100
Low
Step verification can catch reasoning that reaches user-desired but incorrect conclusions
Risk Assessment & Impact
Risk Category
Assessment
Key Metrics
Evidence Source
Safety Uplift
Medium
More transparent reasoning; harder to hide bad logic
Process supervision makes certain forms of deception harder:
Deception Type
Without Process Supervision
With Process Supervision
Wrong answer, lucky guess
Possible
Blocked
Hidden flawed reasoning
Easy
Harder (reasoning visible)
Subtle logic errors
Undetectable
Can be caught
Remaining Vulnerabilities
Vulnerability
Description
Mitigation
Shown vs. Internal Reasoning
Model might have different internal process
Interpretability research
Subtly Flawed Steps
Individual steps valid but combination problematic
Better PRM training
Evaluator Limitations
Humans/PRMs might miss sophisticated errors
More capable oversight
Relationship to Other Approaches
Complementary Techniques
RLHFResearch AreaRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: Process supervision extends RLHF to reasoning steps
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100: Can apply principles to reasoning process
Mechanistic InterpretabilityResearch AreaMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with $100M+ annual investment across major labs. Anthropic extracted 30M+ features from Cl...Quality: 59/100: Could verify internal reasoning matches shown reasoning
This OpenAI study compares outcome supervision (feedback on final answers) versus process supervision (feedback on each reasoning step) for training reliable LLMs on complex math reasoning. Process supervision significantly outperforms outcome supervision on the MATH dataset, achieving 78% accuracy. The authors release PRM800K, a dataset of 800,000 step-level human feedback labels, to support further research.
PRM800K is a dataset released by OpenAI containing 800,000 step-level human correctness labels on large language model solutions to MATH competition problems. It supports training and evaluating process reward models (PRMs), which provide feedback on individual reasoning steps rather than final answers. This dataset underpins research into process supervision as a method for improving LLM reasoning reliability and safety.
OpenAI introduces the o1 model series, which uses chain-of-thought reasoning during inference to significantly improve performance on complex tasks in science, math, and coding. The model is trained via reinforcement learning to 'think' before responding, producing a hidden reasoning trace. This represents a major capability advance, with safety implications around alignment and evaluation.
Anthropic outlines its recommended technical research directions for addressing risks from advanced AI systems, spanning capabilities evaluation, model cognition and interpretability, AI control mechanisms, and multi-agent alignment. The document serves as a high-level research agenda reflecting Anthropic's institutional priorities and understanding of where safety work is most needed.
AI Distributional ShiftRiskAI Distributional ShiftComprehensive analysis of distributional shift showing 40-45% accuracy drops when models encounter novel distributions (ObjectNet vs ImageNet), with 5,202 autonomous vehicle accidents and 15-30% me...Quality: 91/100SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100SycophancyRiskSycophancySycophancy—AI systems agreeing with users over providing accurate information—affects 34-78% of interactions and represents an observable precursor to deceptive alignment. The page frames this as a...Quality: 65/100
Analysis
Alignment Robustness Trajectory ModelAnalysisAlignment Robustness Trajectory ModelThis model estimates alignment robustness degrades from 50-65% at GPT-4 level to 15-30% at 100x capability, with a critical 'alignment valley' at 10-30x where systems are dangerous but can't help s...Quality: 64/100Reward Hacking Taxonomy and Severity ModelAnalysisReward Hacking Taxonomy and Severity ModelTaxonomizes 12 reward hacking modes with likelihood (20-90%) and severity scores, finding proxy exploitation affects 80-95% of current systems (low severity) while deceptive hacking (5-40% likeliho...Quality: 71/100
Approaches
Weak-to-Strong GeneralizationApproachWeak-to-Strong GeneralizationWeak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with co...Quality: 91/100AI Safety via DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100Capability ElicitationApproachCapability ElicitationCapability elicitation—systematically discovering what AI models can actually do through scaffolding, prompting, and fine-tuning—reveals 2-10x performance gaps versus naive testing. METR finds AI a...Quality: 91/100Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is univ...Quality: 55/100Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100Scheming & Deception DetectionApproachScheming & Deception DetectionReviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, including disabling oversight and self-exfiltration attemp...Quality: 91/100
Organizations
AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials ($380B valuation, $14B ARR at Series G growing to $19B by March 2026), safety research (Constitutional AI, mechanistic interpretability...Quality: 74/100
Key Debates
Why Alignment Might Be HardArgumentWhy Alignment Might Be HardA comprehensive taxonomy of alignment difficulty arguments spanning specification problems, inner alignment failures, verification limits, and adversarial dynamics, with expert p(doom) estimates ra...Quality: 69/100
Other
Paul ChristianoPersonPaul ChristianoComprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher o...Quality: 39/100Jan LeikePersonJan LeikeBiography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as VP of Alignment Science at Anthropic. Documents...Quality: 27/100
Concepts
Alignment Training OverviewAlignment Training OverviewA bare-bones index page listing 10 alignment training methods (RLHF, Constitutional AI, DPO, process supervision, etc.) with one-line descriptions and links to deeper pages, providing no analysis, ...Quality: 27/100Model RegistriesConceptModel RegistriesAnalyzes model registries as foundational governance infrastructure across US (≥10^26 FLOP threshold), EU (≥10^25 FLOP), and state-level implementations, showing they enable pre-deployment review a...Quality: 68/100
Policy
California SB 53PolicyCalifornia SB 53California SB 53 represents the first U.S. state law specifically targeting frontier AI safety through transparency requirements, incident reporting, and whistleblower protections, though it makes ...Quality: 73/100