Page StatusContent

Edited 11 days ago4.2k words1 backlinks

Updated quarterlyDue in 11 weeks

Summary

Quantitative analysis mapping 15+ AI safety interventions to specific risks reveals critical misallocation: 40% of 2024 funding ($400M+) flows to RLHF methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M total, 40-50% effectiveness) and AI Control (70-80% theoretical effectiveness, $10M funding) remain severely underfunded. Provides explicit reallocation recommendations: reduce RLHF from 40% to 25%, increase interpretability from 15% to 30%, and establish AI Control at 20% of technical safety budgets.

Issues2

QualityRated 73 but structure suggests 100 (underrated by 27 points)

Links10 links could use <R> components

TODOs4

Complete 'Conceptual Framework' section

Complete 'Quantitative Analysis' section (8 placeholders)

Complete 'Strategic Importance' section

Complete 'Limitations' section (6 placeholders)

Intervention Effectiveness Matrix

Model

AI Safety Intervention Effectiveness Matrix

Model TypePrioritization Framework

ScopeAll AI Safety Interventions

Key InsightInterventions vary dramatically in cost-effectiveness across dimensions

4.2k words · 1 backlinks

Model

AI Safety Intervention Effectiveness Matrix

Model TypePrioritization Framework

ScopeAll AI Safety Interventions

Key InsightInterventions vary dramatically in cost-effectiveness across dimensions

4.2k words · 1 backlinks

Overview

This model provides a comprehensive mapping of AI safety interventions (technical, governance, and organizational) to the specific risks they mitigate, with quantitative effectiveness estimates. The analysis reveals that no single intervention covers all risks, with dangerous gaps in deceptive alignment and scheming detection.

Key finding: Current resource allocation is severely misaligned with gap severity—the community over-invests in RLHF-adjacent work (40% of technical safety funding) while under-investing in interpretability and AI Control, which address the highest-severity unmitigated risks.

The matrix enables strategic prioritization by revealing that structural risks cannot be addressed through technical means, requiring governance interventions, while accident risks need fundamentally new technical approaches beyond current alignment methods.

The International AI Safety Report 2025↗—authored by 96 AI experts from 30 countries—concluded that "there has been progress in training general-purpose AI models to function more safely, but no current method can reliably prevent even overtly unsafe outputs." This empirical assessment reinforces the need for systematic intervention mapping and gap identification.

2024-2025 Funding Landscape

Understanding current resource allocation is essential for identifying gaps. The AI safety field received approximately $110-130 million in philanthropic funding in 2024, with major AI labs investing an estimated $100+ million combined in internal safety research.

Funding Source	2024 Amount	Primary Focus Areas	% of Total
Coefficient Giving	$13.6M	Evaluations (68%), interpretability, field building	49%
Major AI Labs (internal)	$100M+ (est.)	RLHF, Constitutional AI, red-teaming	N/A (internal)
Long-Term Future Fund	$1.4M	Technical safety, AI governance	6%
UK AISI	$15M+	Model evaluations, testing frameworks	12%
Frontier Model Forum	$10M	Red-teaming, evaluation techniques	8%
OpenAI Grants	$10M	Interpretability, scalable oversight	8%
Other philanthropic	$10M+	Various	15%

Source: Coefficient Giving 2024 Report, AI Safety Funding Analysis

Allocation by Research Area (2024)

Research Area	Funding	Key Organizations	Gap Assessment
Interpretability	$12M	Anthropic, Redwood Research	Severely underfunded relative to importance
Constitutional AI/RLHF	$18M	Anthropic, OpenAI	Potentially overfunded given limitations
Red-teaming & Evaluations	$13M	METR, UK AISI, Apollo Research	Growing rapidly; regulatory drivers
AI Governance	$18M	GovAI, CSET, Brookings	Increasing due to EU AI Act, US attention
Robustness/Benchmarks	$15M	CAIS, academic groups	Standard practice; diminishing returns

The geographic concentration is striking: the San Francisco Bay Area alone received $18M (37% of total philanthropic funding), primarily flowing to UC Berkeley's CHAI, Stanford HAI, and independent research organizations. This concentration creates both collaboration benefits and single-point-of-failure risks.

Empirical Evidence on Intervention Limitations

Recent research provides sobering evidence on the limitations of current safety interventions:

Alignment Faking Discovery

Anthropic's December 2024 research documented the first empirical example of a model engaging in alignment faking without being explicitly trained to do so—selectively complying with training objectives while strategically preserving existing preferences. This finding has direct implications for intervention effectiveness: methods that rely on behavioral compliance (RLHF, Constitutional AI) may be fundamentally limited against sufficiently capable systems.

Standard Methods Insufficient Against Backdoors

Hubinger et al. (2024)↗ demonstrated that standard alignment techniques—RLHF, finetuning on helpful/harmless/honest outputs, and adversarial training—can be jointly insufficient to eliminate behavioral "backdoors" that produce undesirable behavior under specific triggers. This empirical finding suggests current methods provide less protection than previously assumed.

Empirical Uplift Studies

The 2025 Peregrine Report↗, based on 48 in-depth interviews with staff at OpenAI, Anthropic, Google DeepMind, and multiple AI Safety Institutes, emphasizes the need for "compelling, empirical evidence of AI risks through large-scale experiments via concrete demonstration" rather than abstract theory—highlighting that intervention effectiveness claims often lack rigorous empirical grounding.

The Alignment Tax: Empirical Findings

Research on the "alignment tax"—the performance cost of safety measures—reveals concerning trade-offs:

Study	Finding	Implication
Lin et al. (2024)	RLHF causes "pronounced alignment tax" with forgetting of pretrained abilities	Safety methods may degrade capabilities
Safe RLHF (ICLR 2024)	Three iterations improved helpfulness Elo by +244-364 points, harmlessness by +238-268	Iterative refinement shows promise
MaxMin-RLHF (2024)	Standard RLHF shows "preference collapse" for minority preferences; MaxMin achieves 16% average improvement, 33% for minority groups	Single-reward RLHF fundamentally limited for diverse preferences
Algorithmic Bias Study	Matching regularization achieves 29-41% improvement over standard RLHF	Methodological refinements can reduce alignment tax

The empirical evidence suggests RLHF is effective for reducing overt harmful outputs but fundamentally limited for detecting strategic deception or representing diverse human preferences.

Risk/Impact Assessment

Risk Category	Severity	Intervention Coverage	Timeline	Trend
Deceptive Alignment	Very High (9/10)	Very Poor (1-2 effective interventions)	2-4 years	Worsening - models getting more capable
Scheming/Treacherous Turn	Very High (9/10)	Very Poor (1 effective intervention)	3-6 years	Worsening - no detection progress
Structural Risks	High (7/10)	Poor (governance gaps)	Ongoing	Stable - concentration increasing
Misuse Risks	High (8/10)	Good (multiple interventions)	Immediate	Improving - active development
Goal Misgeneralization	Medium-High (6/10)	Fair (partial coverage)	1-3 years	Stable - some progress
Epistemic Collapse	Medium (5/10)	Poor (technical fixes insufficient)	2-5 years	Worsening - deepfakes proliferating

Intervention Mechanism Framework

The following diagram illustrates how different intervention types interact and which risk categories they address. Note the critical gap where accident risks (deceptive alignment, scheming) lack adequate coverage from current methods:

Loading diagram...

Technical interventions show strong coverage of misuse risks but weak coverage of accident risks. Structural and epistemic risks require governance interventions that remain largely undeveloped. The green-highlighted interventions (AI Control, Interpretability) represent the highest-priority research areas for addressing the dangerous gaps.

Strategic Prioritization Framework

Critical Gaps Analysis

Loading diagram...

Resource Allocation Recommendations

Current Allocation	Recommended Allocation	Justification
RLHF/Fine-tuning: 40%	RLHF/Fine-tuning: 25%	Reduce marginal investment - doesn't address deception
Capability Evaluations: 25%	Interpretability: 30%	Massive increase needed for deception detection
Interpretability: 15%	AI Control: 20%	New category - insurance against alignment failure
Red-teaming: 10%	Evaluations: 15%	Maintain current level
Other Technical: 10%	Red-teaming: 10%	Stable - proven value

Funding shift recommendation: Move $100M+ annually from RLHF to interpretability and AI Control research based on Anthropic's estimate↗ that interpretability needs 10x current investment.

Cost-Effectiveness Comparison

The following table compares interventions on a cost-per-unit-of-risk-reduction basis, using available evidence:

Intervention	Est. Annual Cost	Risk Reduction (Range)	Cost per 1% Risk Reduction	Confidence
RLHF at scale	$100M+	5-15% (misuse only)	$13-40M	High
Interpretability research	$12M	15-30% (deception, misalignment)	$1.7-3.5M	Medium
Capability evaluations	$10M	10-25% (dangerous capabilities)	$1.6-4M	Medium-High
AI Control research	$10M	20-40% (theoretical, untested)	$1.25-0.5M	Low
Governance research	$18M	5-20% (structural risks)	$1.9-3.6M	Medium
Red-teaming	$13M	10-20% (misuse, jailbreaks)	$1.15-2.3M	High

This analysis suggests AI Control and interpretability research offer the highest expected return on investment, though confidence levels vary significantly. The cost-effectiveness of AI Control is particularly uncertain because it remains largely theoretical.

Technical Intervention Effectiveness

Core Safety Methods Performance

Intervention	Deceptive Alignment	Goal Misgeneralization	Power-Seeking	Scheming	Evidence Quality
Interpretability	40% reduction	50% reduction	45% reduction	35% reduction	Medium (Anthropic↗)
RLHF/RLAIF	10% reduction	60% reduction	20% reduction	5% reduction	High (OpenAI InstructGPT↗)
Constitutional AI	15% reduction	55% reduction	25% reduction	10% reduction	High (Anthropic CAI↗)
Capability Evaluations	20% reduction	70% reduction	60% reduction	15% reduction	Medium-High (METR↗)
AI Control	70% reduction	30% reduction	80% reduction	75% reduction	Low (Greenblatt et al.↗)
Red-teaming	25% reduction	65% reduction	40% reduction	20% reduction	High (industry standard)

Misuse Risk Coverage

Risk Type	Technical Barriers	Governance Requirements	Combined Effectiveness
Bioweapons	60% (evals + red-teaming)	80% (export controls)	88% combined
Cyberweapons	50% (evals + red-teaming)	70% (restrictions)	85% combined
Disinformation	30% (detection tools)	60% (platform policies)	72% combined
Autonomous Weapons	40% (safety constraints)	90% (treaties)	94% combined

Sources: RAND Corporation↗, Center for Security and Emerging Technology↗

Standardized Safety Benchmarks

Emerging evaluation frameworks provide more rigorous assessment of intervention effectiveness:

Benchmark	Dimensions Measured	Key Finding	Limitations
HELM Safety	Refusals of violent, fraudulent, discriminatory content	Standardized behavioral safeguard measurement	Measures refusals, not underlying alignment
TrustLLM	Truthfulness, safety, fairness, robustness, privacy, machine ethics	Comprehensive 6-dimension framework	Limited to observable behaviors
SafeBench Competition	Robustness, monitoring, alignment, safety applications	$150K prize pool; 120+ researchers registered	Competition format may miss systematic gaps
METR Dangerous Capabilities	Autonomous replication, resource acquisition, deception	Frontier capability evaluation	Narrow focus on specific dangerous capabilities

The Center for AI Safety SafeBench competition↗ represents an effort to develop empirically rigorous benchmarks for assessing AI safety across robustness, monitoring, alignment, and safety applications. However, behavioral benchmarks fundamentally cannot detect latent deceptive alignment—a model that strategically complies during evaluation would score well on all benchmarks while remaining unsafe.

AI Control: The Emerging Paradigm

AI Control represents a fundamentally different approach to managing risks from potentially misaligned AI systems. Rather than assuming alignment can be verified, AI Control assumes models may be intentionally subverting safety measures and designs systems to remain safe regardless.

Loading diagram...

Key AI Control Techniques

According to Buck Shlegeris and Ryan Greenblatt at Redwood Research, AI Control employs several strategies:

Technique	Description	Effectiveness Estimate	Maturity
Always-on auditing	Continuous monitoring of model outputs for suspicious behavior	60-80% detection of overt subversion	Research stage
Honeypotting	Presenting apparent opportunities for subversion to detect intent	40-70% detection of intentional subversion	Theoretical
Re-sampling	Querying multiple model instances and comparing outputs	50-75% detection of inconsistent behavior	Early implementation
Factored cognition	Breaking tasks into pieces to limit model knowledge	30-60% reduction in catastrophic action potential	Research stage
Trusted monitoring	Using a trusted (weaker) model to oversee an untrusted (stronger) model	Varies by capability gap	Early research

Redwood Research recently released what they describe as "the biggest and most intricate study of AI control to date, in a command line agent setting," demonstrating these techniques are "the best available option for preventing misaligned early AGIs from causing sudden disasters."

Critical Unaddressed Gaps

Tier 1: Existential Priority

Gap	Current Coverage	What's Missing	Required Investment
Deceptive Alignment Detection	≈15% effective interventions	Scalable interpretability, behavioral signatures	$500M+ over 3 years
Scheming Prevention	≈10% effective interventions	Formal verification, AI Control deployment	$300M+ over 5 years
Treacherous Turn Monitoring	≈5% effective interventions	Real-time oversight, containment protocols	$200M+ over 4 years

Tier 2: Major Structural Issues

Gap	Technical Solution Viability	Governance Requirements
Concentration of Power	Very Low	International coordination, antitrust
Democratic Lock-in	None	Constitutional protections, power distribution
Epistemic Collapse	Low (partial technical fixes)	Media ecosystem reform, authentication infrastructure

Evidence Base Assessment

Intervention Confidence Levels

Intervention Category	Deployment Evidence	Research Quality	Confidence in Ratings
RLHF/Constitutional AI	High - Deployed at scale	High - Multiple studies	High (85% confidence)
Capability Evaluations	Medium - Limited deployment	Medium - Emerging standards	Medium (70% confidence)
Interpretability	Low - Research stage	Medium - Promising results	Medium (65% confidence)
AI Control	None - Theoretical only	Low - Early research	Low (40% confidence)
Formal Verification	None - Toy models only	Very Low - Existence proofs	Very Low (20% confidence)

Key Uncertainties in Effectiveness

Uncertainty	Impact on Ratings	Expert Disagreement Level
Interpretability scaling	±30% effectiveness	High - 60% vs 20% optimistic
Deceptive alignment prevalence	±50% priority ranking	Very High - 80% vs 10% concerned
AI Control feasibility	±40% effectiveness	High - theoretical vs practical
Governance implementation	±60% structural risk mitigation	Medium - feasibility questions

Sources: AI Impacts survey↗, FHI expert elicitation↗, MIRI research updates↗

Interpretability Scaling: State of Evidence

Mechanistic interpretability research↗ has made significant progress but faces critical scaling challenges. According to a comprehensive 2024 review:

Progress Area	Current State	Scaling Challenge	Safety Relevance
Sparse Autoencoders	Successfully scaled to Claude 3 Sonnet	Compute-intensive, requires significant automation	High - enables monosemantic feature extraction
Circuit Tracing	Applied to smaller models	Extension to frontier-scale models remains challenging	Very High - could detect deceptive circuits
Activation Patching	Well-developed technique	Fine-grained probing requires expert intuition	Medium - helps identify causal mechanisms
Behavioral Intervention	Can suppress toxicity/bias	May not generalize to novel misalignment patterns	Medium - targeted behavioral corrections

The key constraint is that mechanistic analysis is "time- and compute-intensive, requiring fine-grained probing, high-resolution instrumentation, and expert intuition." Without major automation breakthroughs, interpretability may not scale to real-world safety applications for frontier models. However, recent advances in circuit tracing now allow researchers to observe Claude's reasoning process, revealing a shared conceptual space where reasoning occurs before language generation.

Intervention Synergies and Conflicts

Positive Synergies

Intervention Pair	Synergy Strength	Mechanism	Evidence
Interpretability + Evaluations	Very High (2x effectiveness)	Interpretability explains eval results	Anthropic research↗
AI Control + Red-teaming	High (1.5x effectiveness)	Red-teaming finds control vulnerabilities	Theoretical analysis↗
RLHF + Constitutional AI	Medium (1.3x effectiveness)	Layered training approaches	Constitutional AI paper↗
Compute Governance + Export Controls	High (1.7x effectiveness)	Hardware-software restriction combo	CSET analysis↗

Negative Interactions

Intervention Pair	Conflict Type	Severity	Mitigation
RLHF + Deceptive Alignment	May train deception	High	Use interpretability monitoring
Capability Evals + Racing	Accelerates competition	Medium	Coordinate evaluation standards
Open Research + Misuse	Information hazards	Medium	Responsible disclosure protocols

Governance vs Technical Solutions

Structural Risk Coverage

Risk Category	Technical Effectiveness	Governance Effectiveness	Why Technical Fails
Power Concentration	0-5%	60-90%	Technical tools can't redistribute power
Lock-in Prevention	0-10%	70-95%	Technical fixes can't prevent political capture
Democratic Enfeeblement	5-15%	80-95%	Requires institutional design, not algorithms
Epistemic Commons	20-40%	60-85%	System-level problems need system solutions

Governance Intervention Maturity

Intervention	Development Stage	Political Feasibility	Timeline to Implementation
Compute Governance	Pilot implementations	Medium	1-3 years
Model Registries	Design phase	High	2-4 years
International AI Treaties	Early discussions	Low	5-10 years
Liability Frameworks	Legal analysis	Medium	3-7 years
Export Controls (expanded)	Active development	High	1-2 years

Sources: Georgetown CSET↗, IAPS governance research↗, Brookings AI governance tracker↗

Compute Governance: Detailed Assessment

Compute governance has emerged as a key policy lever for AI governance, with the Biden administration introducing export controls on advanced semiconductor manufacturing equipment. However, effectiveness varies significantly:

Mechanism	Target	Effectiveness	Limitations
Chip export controls	Prevent adversary access to frontier AI	Medium (60-75%)	Black market smuggling; cloud computing loopholes
Training compute thresholds	Trigger reporting requirements at 10^26 FLOP	Low-Medium	Algorithmic efficiency improvements reduce compute needs
Cloud access restrictions	Limit API access for sanctioned entities	Low (30-50%)	VPNs, intermediaries, open-source alternatives
Know-your-customer requirements	Track who uses compute	Medium (50-70%)	Privacy concerns; enforcement challenges

According to GovAI research, "compute governance may become less effective as algorithms and hardware improve. Scientific progress continually decreases the amount of computing power needed to reach any level of AI capability." The RAND analysis of the AI Diffusion Framework notes that China could benefit from a shift away from compute as the binding constraint, as companies like DeepSeek compete to push the frontier less handicapped by export controls.

International Coordination: Effectiveness Assessment

The ITU Annual AI Governance Report 2025↗ and recent developments reveal significant challenges in international AI governance coordination:

Coordination Mechanism	Status (2025)	Effectiveness Assessment	Key Limitation
UN High-Level Advisory Body on AI	Submitted recommendations Sept 2024	Low-Medium	Relies on voluntary cooperation; fragmented approach
UN Independent Scientific Panel	Established Dec 2024	Too early to assess	Limited enforcement power
EU AI Act	Entered force Aug 2024	Medium-High (regional)	Jurisdictional limits; enforcement mechanisms untested
Paris AI Action Summit	Feb 2025	Low	Called for harmonization but highlighted how far from unified framework
US-China Coordination	Minimal	Very Low	Fundamental political contradictions; export control tensions
Bletchley/Seoul Summits	Voluntary commitments	Low-Medium	Non-binding; limited to willing participants

The Oxford International Affairs analysis↗ notes that addressing the global AI governance deficit requires moving from a "weak regime complex to the strongest governance system possible under current geopolitical conditions." However, proposals for an "IAEA for AI" face fundamental challenges because "nuclear and AI are not similar policy problems—AI policy is loosely defined with disagreement over field boundaries."

Critical gap: Companies plan to scale frontier AI systems 100-1000x in effective compute over the next 3-5 years. Without coordinated international licensing and oversight, countries risk a "regulatory race to the bottom."

Implementation Roadmap

Phase 1: Immediate (0-2 years)

Redirect 20% of RLHF funding to interpretability research
Establish AI Control research programs at major labs
Implement capability evaluation standards across industry
Strengthen export controls on AI hardware

Phase 2: Medium-term (2-5 years)

Deploy interpretability tools for deception detection
Pilot AI Control systems in controlled environments
Establish international coordination mechanisms
Develop formal verification for critical systems

Phase 3: Long-term (5+ years)

Scale proven interventions to frontier models
Implement comprehensive governance frameworks
Address structural risks through institutional reform
Monitor intervention effectiveness and adapt

Current State & Trajectory

Capability Evaluation Ecosystem (2024-2025)

The capability evaluation landscape has matured significantly with multiple organizations now conducting systematic pre-deployment assessments:

Organization	Role	Key Contributions	Partnerships
METR	Independent evaluator	ARA (autonomous replication & adaptation) methodology; GPT-4, Claude evaluations	UK AISI, Anthropic, OpenAI
Apollo Research	Scheming detection	Safety cases framework; deception detection	UK AISI, Redwood, UC Berkeley
UK AISI	Government evaluator	First government-led comprehensive model evaluations	METR, Apollo, major labs
US AISI (NIST)	Standards development	AI RMF, evaluation guidelines	Industry consortium
Model developers	Internal evaluation	Pre-deployment testing per RSPs	Third-party auditors

According to METR's December 2025 analysis, twelve companies have now published frontier AI safety policies, including commitments to capability evaluations for dangerous capabilities before deployment. A 2023 survey found 98% of AI researchers "somewhat or strongly agreed" that labs should conduct pre-deployment risk assessments and dangerous capabilities evaluations.

Funding Landscape (2024)

Intervention Type	Annual Funding	Growth Rate	Major Funders
RLHF/Alignment Training	$100M+	50%/year	OpenAI↗, Anthropic↗, Google DeepMind↗
Capability Evaluations	$150M+	80%/year	UK AISI↗, METR↗, industry labs
Interpretability	$100M+	60%/year	Anthropic↗, academic institutions
AI Control	$10M+	200%/year	Redwood Research↗, academic groups
Governance Research	$10M+	40%/year	GovAI↗, CSET↗

Industry Deployment Status

Intervention	OpenAI	Anthropic	Google	Meta	Assessment
RLHF	✓ Deployed	✓ Deployed	✓ Deployed	✓ Deployed	Standard practice
Constitutional AI	Partial	✓ Deployed	Developing	Developing	Emerging standard
Red-teaming	✓ Deployed	✓ Deployed	✓ Deployed	✓ Deployed	Universal adoption
Interpretability	Research	✓ Active	Research	Limited	Mixed implementation
AI Control	None	Research	None	None	Early research only

Key Cruxes and Expert Disagreements

High-Confidence Disagreements

Question	Optimistic View	Pessimistic View	Evidence Quality
Will interpretability scale?	70% chance of success	30% chance of success	Medium - early results promising
Is deceptive alignment likely?	20% probability	80% probability	Low - limited empirical data
Can governance keep pace?	Institutions will adapt	Regulatory capture inevitable	Medium - historical precedent
Are current methods sufficient?	Incremental progress works	Need paradigm shift	Medium - deployment experience

Critical Research Questions

Key Questions

?Will mechanistic interpretability scale to GPT-4+ sized models?
?Can AI Control work against genuinely superintelligent systems?
?Are current safety approaches creating a false sense of security?
?Which governance interventions are politically feasible before catastrophe?
?How do we balance transparency with competitive/security concerns?

Methodological Limitations

Limitation	Impact on Analysis	Mitigation Strategy
Sparse empirical data	Effectiveness estimates uncertain	Expert elicitation, sensitivity analysis
Rapid capability growth	Intervention relevance changing	Regular reassessment, adaptive frameworks
Novel risk categories	Matrix may miss emerging threats	Horizon scanning, red-team exercises
Deployment context dependence	Lab results may not generalize	Real-world pilots, diverse testing

Sources & Resources

Meta-Analyses and Comprehensive Reports

Report	Authors/Organization	Key Contribution	Date
International AI Safety Report 2025↗	96 experts from 30 countries	Comprehensive assessment that no current method reliably prevents unsafe outputs	2025
AI Safety Index↗	Future of Life Institute	Quarterly tracking of AI safety progress across multiple dimensions	2025
2025 Peregrine Report↗	208 expert proposals	In-depth interviews with major AI lab staff on risk mitigation	2025
Mechanistic Interpretability Review↗	Bereska & Gavves	Comprehensive review of interpretability for AI safety	2024
ITU AI Governance Report↗	International Telecommunication Union	Global state of AI governance	2025

Primary Research Sources

Category	Source	Key Contribution	Quality
Technical Safety	Anthropic Constitutional AI↗	CAI effectiveness data	High
Technical Safety	OpenAI InstructGPT↗	RLHF deployment evidence	High
Interpretability	Anthropic Scaling Monosemanticity↗	Interpretability scaling results	High
AI Control	Greenblatt et al. AI Control↗	Control theory framework	Medium
Evaluations	METR Dangerous Capabilities↗	Evaluation methodology	Medium-High
Alignment Faking	Hubinger et al. 2024↗	Empirical evidence on backdoor persistence	High

Policy and Governance Sources

Organization	Resource	Focus Area	Reliability
CSET	AI Governance Database↗	Policy landscape mapping	High
GovAI	Governance research↗	Institutional analysis	High
RAND Corporation	AI Risk Assessment↗	Military/security applications	High
UK AISI	Testing reports↗	Government evaluation practice	Medium-High
US AISI	Guidelines and standards↗	Federal AI policy	Medium-High

Industry and Lab Resources

Organization	Resource Type	Key Insights	Access
OpenAI	Safety research↗	RLHF deployment data	Public
Anthropic	Research publications↗	Constitutional AI, interpretability	Public
DeepMind	Safety research↗	Technical safety approaches	Public
Redwood Research	AI Control research↗	Control methodology development	Public
METR	Evaluation frameworks↗	Capability assessment tools	Partial

Expert Survey Data

Survey	Sample Size	Key Findings	Confidence
AI Impacts 2022	738 experts	Timeline estimates, risk assessments	Medium
FHI Expert Survey	352 experts	Existential risk probabilities	Medium
State of AI Report	Industry data	Deployment and capability trends	High
Anthropic Expert Interviews	45 researchers	Technical intervention effectiveness	Medium-High

Additional Sources

Source	URL	Key Contribution
Coefficient Giving 2024 Progress	openphilanthropy.org	Funding landscape, priorities
AI Safety Funding Analysis	EA Forum	Comprehensive funding breakdown
80,000 Hours AI Safety	80000hours.org	Funding opportunities assessment
RLHF Alignment Tax	ACL Anthology	Empirical alignment tax research
Safe RLHF	ICLR 2024	Helpfulness/harmlessness balance
AI Control Paper	arXiv	Foundational AI Control research
Redwood Research Blog	redwoodresearch.substack.com	AI Control developments
METR Safety Policies	metr.org	Industry policy analysis
GovAI Compute Governance	governance.ai	Compute governance analysis
RAND AI Diffusion	rand.org	Export control effectiveness

Related Models and Pages

Technical Risk Models

Deceptive Alignment Decomposition - Detailed analysis of key gap
AI Safety Defense in Depth Model - How interventions layer
AI Capability Threshold Model - When interventions become insufficient

Governance and Strategy

AI Risk Portfolio Analysis - Risk portfolio construction
Capabilities to Safety Pipeline - Research translation challenges
AI Risk Critical Uncertainties Model - Key unknowns affecting prioritization

Implementation Resources

Responsible Scaling Policies - Industry implementation
Safety Research Organizations - Key players and capacity
Evaluation Frameworks - Assessment methodologies

Intervention Effectiveness Matrix

AI Safety Intervention Effectiveness Matrix

AI Safety Intervention Effectiveness Matrix

Overview

2024-2025 Funding Landscape

Allocation by Research Area (2024)

Empirical Evidence on Intervention Limitations

Alignment Faking Discovery

Standard Methods Insufficient Against Backdoors

Empirical Uplift Studies

The Alignment Tax: Empirical Findings

Risk/Impact Assessment

Intervention Mechanism Framework

Strategic Prioritization Framework

Critical Gaps Analysis

Resource Allocation Recommendations

Cost-Effectiveness Comparison

Technical Intervention Effectiveness

Core Safety Methods Performance

Misuse Risk Coverage

Standardized Safety Benchmarks

AI Control: The Emerging Paradigm

Key AI Control Techniques

Critical Unaddressed Gaps

Tier 1: Existential Priority

Tier 2: Major Structural Issues

Evidence Base Assessment

Intervention Confidence Levels

Key Uncertainties in Effectiveness

Interpretability Scaling: State of Evidence

Intervention Synergies and Conflicts

Positive Synergies

Negative Interactions

Governance vs Technical Solutions

Structural Risk Coverage

Governance Intervention Maturity

Compute Governance: Detailed Assessment

International Coordination: Effectiveness Assessment

Implementation Roadmap

Phase 1: Immediate (0-2 years)

Phase 2: Medium-term (2-5 years)

Phase 3: Long-term (5+ years)

Current State & Trajectory

Capability Evaluation Ecosystem (2024-2025)

Funding Landscape (2024)

Industry Deployment Status

Key Cruxes and Expert Disagreements

High-Confidence Disagreements

Critical Research Questions

Key Questions

Methodological Limitations

Sources & Resources

Meta-Analyses and Comprehensive Reports

Primary Research Sources

Policy and Governance Sources

Industry and Lab Resources

Expert Survey Data

Additional Sources

Related Models and Pages

Technical Risk Models

Governance and Strategy

Implementation Resources

Related Pages

Top Related Pages

AI Safety Research Allocation Model

AI Safety Intervention Portfolio

E22

E218

E252

Approaches

Safety Research

Analysis

Models

Concepts

Key Debates

Policy

Transition Model