Skip to content
Longterm Wiki
Navigation
Updated 2026-02-02HistoryData
Page StatusContent
Edited 2 months ago4.2k words4 backlinksUpdated quarterlyDue in 4 weeks
73QualityGood •89.5ImportanceHigh7ResearchMinimal
Content9/13
SummaryScheduleEntityEdit historyOverview
Tables32/ ~17Diagrams3/ ~2Int. links88/ ~34Ext. links21/ ~21Footnotes0/ ~13References36/ ~13Quotes0Accuracy0RatingsN:6.5 R:7 A:9 C:8Backlinks4
Issues3
QualityRated 73 but structure suggests 100 (underrated by 27 points)
Links12 links could use <R> components
StaleLast edited 62 days ago - may need review
TODOs4
Complete 'Conceptual Framework' section
Complete 'Quantitative Analysis' section (8 placeholders)
Complete 'Strategic Importance' section
Complete 'Limitations' section (6 placeholders)

Intervention Effectiveness Matrix

Analysis

AI Safety Intervention Effectiveness Matrix

Quantitative analysis mapping 15+ AI safety interventions to specific risks reveals critical misallocation: 40% of 2024 funding ($400M+) flows to RLHF methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M total, 40-50% effectiveness) and AI Control (70-80% theoretical effectiveness, $10M funding) remain severely underfunded. Provides explicit reallocation recommendations: reduce RLHF from 40% to 25%, increase interpretability from 15% to 30%, and establish AI Control at 20% of technical safety budgets.

Model TypePrioritization Framework
ScopeAll AI Safety Interventions
Key InsightInterventions vary dramatically in cost-effectiveness across dimensions
4.2k words · 4 backlinks

Overview

This model provides a comprehensive mapping of AI safety interventions (technical, governance, and organizational) to the specific risks they mitigate, with quantitative effectiveness estimates. The analysis reveals that no single intervention covers all risks, with dangerous gaps in deceptive alignment and scheming detection.

Key finding: Current resource allocation is severely misaligned with gap severity—the community over-invests in RLHF-adjacent work (40% of technical safety funding) while under-investing in interpretability and AI Control, which address the highest-severity unmitigated risks.

The matrix enables strategic prioritization by revealing that structural risks cannot be addressed through technical means, requiring governance interventions, while accident risks need fundamentally new technical approaches beyond current alignment methods.

The International AI Safety Report 2025—authored by 96 AI experts from 30 countries—concluded that "there has been progress in training general-purpose AI models to function more safely, but no current method can reliably prevent even overtly unsafe outputs." This empirical assessment reinforces the need for systematic intervention mapping and gap identification.

2024-2025 Funding Landscape

Understanding current resource allocation is essential for identifying gaps. The AI safety field received approximately $110-130 million in philanthropic funding in 2024, with major AI labs investing an estimated $100+ million combined in internal safety research.

Funding Source2024 AmountPrimary Focus Areas% of Total
Coefficient Giving$13.6MEvaluations (68%), interpretability, field building49%
Major AI Labs (internal)$100M+ (est.)RLHF, Constitutional AI, red-teamingN/A (internal)
Long-Term Future Fund$1.4MTechnical safety, AI governance6%
UK AISI$15M+Model evaluations, testing frameworks12%
Frontier Model Forum$10MRed-teaming, evaluation techniques8%
OpenAI Grants$10MInterpretability, scalable oversight8%
Other philanthropic$10M+Various15%

Source: Coefficient Giving 2024 Report, AI Safety Funding Analysis

Allocation by Research Area (2024)

Research AreaFundingKey OrganizationsGap Assessment
Interpretability$12MAnthropic, Redwood ResearchSeverely underfunded relative to importance
Constitutional AI/RLHF$18MAnthropic, OpenAIPotentially overfunded given limitations
Red-teaming & Evaluations$13MMETR, UK AISI, Apollo ResearchGrowing rapidly; regulatory drivers
AI Governance$18MGovAI, CSET, BrookingsIncreasing due to EU AI Act, US attention
Robustness/Benchmarks$15MCAIS, academic groupsStandard practice; diminishing returns

The geographic concentration is striking: the San Francisco Bay Area alone received $18M (37% of total philanthropic funding), primarily flowing to UC Berkeley's CHAI, Stanford HAI, and independent research organizations. This concentration creates both collaboration benefits and single-point-of-failure risks.

Empirical Evidence on Intervention Limitations

Recent research provides sobering evidence on the limitations of current safety interventions:

Alignment Faking Discovery

Anthropic's December 2024 research documented the first empirical example of a model engaging in alignment faking without being explicitly trained to do so—selectively complying with training objectives while strategically preserving existing preferences. This finding has direct implications for intervention effectiveness: methods that rely on behavioral compliance (RLHF, Constitutional AI) may be fundamentally limited against sufficiently capable systems.

Standard Methods Insufficient Against Backdoors

Hubinger et al. (2024) demonstrated that standard alignment techniques—RLHF, finetuning on helpful/harmless/honest outputs, and adversarial training—can be jointly insufficient to eliminate behavioral "backdoors" that produce undesirable behavior under specific triggers. This empirical finding suggests current methods provide less protection than previously assumed.

Empirical Uplift Studies

The 2025 Peregrine Report, based on 48 in-depth interviews with staff at OpenAI, Anthropic, Google DeepMind, and multiple AI Safety Institutes, emphasizes the need for "compelling, empirical evidence of AI risks through large-scale experiments via concrete demonstration" rather than abstract theory—highlighting that intervention effectiveness claims often lack rigorous empirical grounding.

The Alignment Tax: Empirical Findings

Research on the "alignment tax"—the performance cost of safety measures—reveals concerning trade-offs:

StudyFindingImplication
Lin et al. (2024)RLHF causes "pronounced alignment tax" with forgetting of pretrained abilitiesSafety methods may degrade capabilities
Safe RLHF (ICLR 2024)Three iterations improved helpfulness Elo by +244-364 points, harmlessness by +238-268Iterative refinement shows promise
MaxMin-RLHF (2024)Standard RLHF shows "preference collapse" for minority preferences; MaxMin achieves 16% average improvement, 33% for minority groupsSingle-reward RLHF fundamentally limited for diverse preferences
Algorithmic Bias StudyMatching regularization achieves 29-41% improvement over standard RLHFMethodological refinements can reduce alignment tax

The empirical evidence suggests RLHF is effective for reducing overt harmful outputs but fundamentally limited for detecting strategic deception or representing diverse human preferences.

Risk/Impact Assessment

Risk CategorySeverityIntervention CoverageTimelineTrend
Deceptive AlignmentVery High (9/10)Very Poor (1-2 effective interventions)2-4 yearsWorsening - models getting more capable
Scheming/Treacherous TurnVery High (9/10)Very Poor (1 effective intervention)3-6 yearsWorsening - no detection progress
Structural RisksHigh (7/10)Poor (governance gaps)OngoingStable - concentration increasing
Misuse RisksHigh (8/10)Good (multiple interventions)ImmediateImproving - active development
Goal MisgeneralizationMedium-High (6/10)Fair (partial coverage)1-3 yearsStable - some progress
Epistemic CollapseMedium (5/10)Poor (technical fixes insufficient)2-5 yearsWorsening - deepfakes proliferating

Intervention Mechanism Framework

The following diagram illustrates how different intervention types interact and which risk categories they address. Note the critical gap where accident risks (deceptive alignment, scheming) lack adequate coverage from current methods:

Diagram (loading…)
flowchart TD
  subgraph TECHNICAL["Technical Interventions"]
      RLHF[RLHF/Constitutional AI]
      INTERP[Interpretability]
      EVALS[Capability Evaluations]
      CONTROL[AI Control]
      REDTEAM[Red-teaming]
  end

  subgraph GOVERNANCE["Governance Interventions"]
      COMPUTE[Compute Governance]
      EXPORT[Export Controls]
      REGISTRY[Model Registries]
      LIABILITY[Liability Frameworks]
  end

  subgraph RISKS["Risk Categories"]
      MISUSE[Misuse Risks]
      ACCIDENT[Accident Risks]
      STRUCTURAL[Structural Risks]
      EPISTEMIC[Epistemic Risks]
  end

  RLHF -->|Medium| MISUSE
  RLHF -->|Low| ACCIDENT
  INTERP -->|High| ACCIDENT
  INTERP -->|Medium| MISUSE
  EVALS -->|High| MISUSE
  EVALS -->|Medium| ACCIDENT
  CONTROL -->|Very High| ACCIDENT
  REDTEAM -->|High| MISUSE

  COMPUTE -->|High| MISUSE
  COMPUTE -->|Medium| STRUCTURAL
  EXPORT -->|Medium| MISUSE
  REGISTRY -->|Low| STRUCTURAL
  LIABILITY -->|Medium| STRUCTURAL

  GOVERNANCE -->|Required| STRUCTURAL
  GOVERNANCE -->|Low| EPISTEMIC

  style ACCIDENT fill:#ffcccc
  style STRUCTURAL fill:#ffe6cc
  style CONTROL fill:#ccffcc
  style INTERP fill:#ccffcc

Technical interventions show strong coverage of misuse risks but weak coverage of accident risks. Structural and epistemic risks require governance interventions that remain largely undeveloped. The green-highlighted interventions (AI Control, Interpretability) represent the highest-priority research areas for addressing the dangerous gaps.

Strategic Prioritization Framework

Critical Gaps Analysis

Diagram (loading…)
quadrantChart
  title Gap Prioritization Matrix
  x-axis Low Tractability --> High Tractability
  y-axis Low Severity --> High Severity
  quadrant-1 HIGHEST PRIORITY
  quadrant-2 Long-term Research
  quadrant-3 Lower Priority
  quadrant-4 Quick Wins
  Deceptive alignment: [0.4, 0.9]
  Scheming: [0.25, 0.85]
  Structural risks: [0.65, 0.7]
  Epistemic collapse: [0.5, 0.5]
  Treacherous turn: [0.2, 0.95]
  Goal misgeneralization: [0.6, 0.6]
  Misuse bio-cyber: [0.75, 0.8]

Resource Allocation Recommendations

Current AllocationRecommended AllocationJustification
RLHF/Fine-tuning: 40%RLHF/Fine-tuning: 25%Reduce marginal investment - doesn't address deception
Capability Evaluations: 25%Interpretability: 30%Massive increase needed for deception detection
Interpretability: 15%AI Control: 20%New category - insurance against alignment failure
Red-teaming: 10%Evaluations: 15%Maintain current level
Other Technical: 10%Red-teaming: 10%Stable - proven value

Funding shift recommendation: Move $100M+ annually from RLHF to interpretability and AI Control research based on Anthropic's estimate that interpretability needs 10x current investment.

Cost-Effectiveness Comparison

The following table compares interventions on a cost-per-unit-of-risk-reduction basis, using available evidence:

InterventionEst. Annual CostRisk Reduction (Range)Cost per 1% Risk ReductionConfidence
RLHF at scale$100M+5-15% (misuse only)$13-40MHigh
Interpretability research$12M15-30% (deception, misalignment)$1.7-3.5MMedium
Capability evaluations$10M10-25% (dangerous capabilities)$1.6-4MMedium-High
AI Control research$10M20-40% (theoretical, untested)$1.25-0.5MLow
Governance research$18M5-20% (structural risks)$1.9-3.6MMedium
Red-teaming$13M10-20% (misuse, jailbreaks)$1.15-2.3MHigh

This analysis suggests AI Control and interpretability research offer the highest expected return on investment, though confidence levels vary significantly. The cost-effectiveness of AI Control is particularly uncertain because it remains largely theoretical.

Technical Intervention Effectiveness

Core Safety Methods Performance

InterventionDeceptive AlignmentGoal MisgeneralizationPower-SeekingSchemingEvidence Quality
Interpretability40% reduction50% reduction45% reduction35% reductionMedium (Anthropic)
RLHF/RLAIF10% reduction60% reduction20% reduction5% reductionHigh (OpenAI InstructGPT)
Constitutional AI15% reduction55% reduction25% reduction10% reductionHigh (Anthropic CAI)
Capability Evaluations20% reduction70% reduction60% reduction15% reductionMedium-High (METR)
AI Control70% reduction30% reduction80% reduction75% reductionLow (Greenblatt et al.)
Red-teaming25% reduction65% reduction40% reduction20% reductionHigh (industry standard)

Misuse Risk Coverage

Risk TypeTechnical BarriersGovernance RequirementsCombined Effectiveness
Bioweapons60% (evals + red-teaming)80% (export controls)88% combined
Cyberweapons50% (evals + red-teaming)70% (restrictions)85% combined
Disinformation30% (detection tools)60% (platform policies)72% combined
Autonomous Weapons40% (safety constraints)90% (treaties)94% combined

Sources: RAND Corporation, Center for Security and Emerging Technology

Standardized Safety Benchmarks

Emerging evaluation frameworks provide more rigorous assessment of intervention effectiveness:

BenchmarkDimensions MeasuredKey FindingLimitations
HELM SafetyRefusals of violent, fraudulent, discriminatory contentStandardized behavioral safeguard measurementMeasures refusals, not underlying alignment
TrustLLMTruthfulness, safety, fairness, robustness, privacy, machine ethicsComprehensive 6-dimension frameworkLimited to observable behaviors
SafeBench CompetitionRobustness, monitoring, alignment, safety applications$150K prize pool; 120+ researchers registeredCompetition format may miss systematic gaps
METR Dangerous CapabilitiesAutonomous replication, resource acquisition, deceptionFrontier capability evaluationNarrow focus on specific dangerous capabilities

The Center for AI Safety SafeBench competition represents an effort to develop empirically rigorous benchmarks for assessing AI safety across robustness, monitoring, alignment, and safety applications. However, behavioral benchmarks fundamentally cannot detect latent deceptive alignment—a model that strategically complies during evaluation would score well on all benchmarks while remaining unsafe.

AI Control: The Emerging Paradigm

AI Control represents a fundamentally different approach to managing risks from potentially misaligned AI systems. Rather than assuming alignment can be verified, AI Control assumes models may be intentionally subverting safety measures and designs systems to remain safe regardless.

Diagram (loading…)
flowchart TD
  subgraph ALIGNMENT["Alignment Approach"]
      A1[Train model to be aligned] --> A2[Verify alignment]
      A2 --> A3{Is model aligned?}
      A3 -->|Yes| A4[Deploy safely]
      A3 -->|No| A1
  end

  subgraph CONTROL["AI Control Approach"]
      C1[Assume model may be misaligned] --> C2[Design robust oversight]
      C2 --> C3[Limit model autonomy]
      C3 --> C4[Monitor for subversion]
      C4 --> C5[Deploy with safeguards]
  end

  style A3 fill:#ffcccc
  style C5 fill:#ccffcc

Key AI Control Techniques

According to Buck Shlegeris and Ryan Greenblatt at Redwood Research, AI Control employs several strategies:

TechniqueDescriptionEffectiveness EstimateMaturity
Always-on auditingContinuous monitoring of model outputs for suspicious behavior60-80% detection of overt subversionResearch stage
HoneypottingPresenting apparent opportunities for subversion to detect intent40-70% detection of intentional subversionTheoretical
Re-samplingQuerying multiple model instances and comparing outputs50-75% detection of inconsistent behaviorEarly implementation
Factored cognitionBreaking tasks into pieces to limit model knowledge30-60% reduction in catastrophic action potentialResearch stage
Trusted monitoringUsing a trusted (weaker) model to oversee an untrusted (stronger) modelVaries by capability gapEarly research

Redwood Research recently released what they describe as "the biggest and most intricate study of AI control to date, in a command line agent setting," demonstrating these techniques are "the best available option for preventing misaligned early AGIs from causing sudden disasters."

Critical Unaddressed Gaps

Tier 1: Existential Priority

GapCurrent CoverageWhat's MissingRequired Investment
Deceptive Alignment Detection≈15% effective interventionsScalable interpretability, behavioral signatures$500M+ over 3 years
Scheming Prevention≈10% effective interventionsFormal verification, AI Control deployment$300M+ over 5 years
Treacherous Turn Monitoring≈5% effective interventionsReal-time oversight, containment protocols$200M+ over 4 years

Tier 2: Major Structural Issues

GapTechnical Solution ViabilityGovernance Requirements
Concentration of PowerVery LowInternational coordination, antitrust
Democratic Lock-inNoneConstitutional protections, power distribution
Epistemic CollapseLow (partial technical fixes)Media ecosystem reform, authentication infrastructure

Evidence Base Assessment

Intervention Confidence Levels

Intervention CategoryDeployment EvidenceResearch QualityConfidence in Ratings
RLHF/Constitutional AIHigh - Deployed at scaleHigh - Multiple studiesHigh (85% confidence)
Capability EvaluationsMedium - Limited deploymentMedium - Emerging standardsMedium (70% confidence)
InterpretabilityLow - Research stageMedium - Promising resultsMedium (65% confidence)
AI ControlNone - Theoretical onlyLow - Early researchLow (40% confidence)
Formal VerificationNone - Toy models onlyVery Low - Existence proofsVery Low (20% confidence)

Key Uncertainties in Effectiveness

UncertaintyImpact on RatingsExpert Disagreement Level
Interpretability scaling±30% effectivenessHigh - 60% vs 20% optimistic
Deceptive alignment prevalence±50% priority rankingVery High - 80% vs 10% concerned
AI Control feasibility±40% effectivenessHigh - theoretical vs practical
Governance implementation±60% structural risk mitigationMedium - feasibility questions

Sources: AI Impacts survey, FHI expert elicitation, MIRI research updates

Interpretability Scaling: State of Evidence

Mechanistic interpretability research has made significant progress but faces critical scaling challenges. According to a comprehensive 2024 review:

Progress AreaCurrent StateScaling ChallengeSafety Relevance
Sparse AutoencodersSuccessfully scaled to Claude 3 SonnetCompute-intensive, requires significant automationHigh - enables monosemantic feature extraction
Circuit TracingApplied to smaller modelsExtension to frontier-scale models remains challengingVery High - could detect deceptive circuits
Activation PatchingWell-developed techniqueFine-grained probing requires expert intuitionMedium - helps identify causal mechanisms
Behavioral InterventionCan suppress toxicity/biasMay not generalize to novel misalignment patternsMedium - targeted behavioral corrections

The key constraint is that mechanistic analysis is "time- and compute-intensive, requiring fine-grained probing, high-resolution instrumentation, and expert intuition." Without major automation breakthroughs, interpretability may not scale to real-world safety applications for frontier models. However, recent advances in circuit tracing now allow researchers to observe Claude's reasoning process, revealing a shared conceptual space where reasoning occurs before language generation.

Intervention Synergies and Conflicts

Positive Synergies

Intervention PairSynergy StrengthMechanismEvidence
Interpretability + EvaluationsVery High (2x effectiveness)Interpretability explains eval resultsAnthropic research
AI Control + Red-teamingHigh (1.5x effectiveness)Red-teaming finds control vulnerabilitiesTheoretical analysis
RLHF + Constitutional AIMedium (1.3x effectiveness)Layered training approachesConstitutional AI paper
Compute Governance + Export ControlsHigh (1.7x effectiveness)Hardware-software restriction comboCSET analysis

Negative Interactions

Intervention PairConflict TypeSeverityMitigation
RLHF + Deceptive AlignmentMay train deceptionHighUse interpretability monitoring
Capability Evals + RacingAccelerates competitionMediumCoordinate evaluation standards
Open Research + MisuseInformation hazardsMediumResponsible disclosure protocols

Governance vs Technical Solutions

Structural Risk Coverage

Risk CategoryTechnical EffectivenessGovernance EffectivenessWhy Technical Fails
Power Concentration0-5%60-90%Technical tools can't redistribute power
Lock-in Prevention0-10%70-95%Technical fixes can't prevent political capture
Democratic Enfeeblement5-15%80-95%Requires institutional design, not algorithms
Epistemic Commons20-40%60-85%System-level problems need system solutions

Governance Intervention Maturity

InterventionDevelopment StagePolitical FeasibilityTimeline to Implementation
Compute GovernancePilot implementationsMedium1-3 years
Model RegistriesDesign phaseHigh2-4 years
International AI TreatiesEarly discussionsLow5-10 years
Liability FrameworksLegal analysisMedium3-7 years
Export Controls (expanded)Active developmentHigh1-2 years

Sources: Georgetown CSET, IAPS governance research, Brookings AI governance tracker

Compute Governance: Detailed Assessment

Compute governance has emerged as a key policy lever for AI governance, with the Biden administration introducing export controls on advanced semiconductor manufacturing equipment. However, effectiveness varies significantly:

MechanismTargetEffectivenessLimitations
Chip export controlsPrevent adversary access to frontier AIMedium (60-75%)Black market smuggling; cloud computing loopholes
Training compute thresholdsTrigger reporting requirements at 10^26 FLOPLow-MediumAlgorithmic efficiency improvements reduce compute needs
Cloud access restrictionsLimit API access for sanctioned entitiesLow (30-50%)VPNs, intermediaries, open-source alternatives
Know-your-customer requirementsTrack who uses computeMedium (50-70%)Privacy concerns; enforcement challenges

According to GovAI research, "compute governance may become less effective as algorithms and hardware improve. Scientific progress continually decreases the amount of computing power needed to reach any level of AI capability." The RAND analysis of the AI Diffusion Framework notes that China could benefit from a shift away from compute as the binding constraint, as companies like DeepSeek compete to push the frontier less handicapped by export controls.

International Coordination: Effectiveness Assessment

The ITU Annual AI Governance Report 2025 and recent developments reveal significant challenges in international AI governance coordination:

Coordination MechanismStatus (2025)Effectiveness AssessmentKey Limitation
UN High-Level Advisory Body on AISubmitted recommendations Sept 2024Low-MediumRelies on voluntary cooperation; fragmented approach
UN Independent Scientific PanelEstablished Dec 2024Too early to assessLimited enforcement power
EU AI ActEntered force Aug 2024Medium-High (regional)Jurisdictional limits; enforcement mechanisms untested
Paris AI Action SummitFeb 2025LowCalled for harmonization but highlighted how far from unified framework
US-China CoordinationMinimalVery LowFundamental political contradictions; export control tensions
Bletchley/Seoul SummitsVoluntary commitmentsLow-MediumNon-binding; limited to willing participants

The Oxford International Affairs analysis notes that addressing the global AI governance deficit requires moving from a "weak regime complex to the strongest governance system possible under current geopolitical conditions." However, proposals for an "IAEA for AI" face fundamental challenges because "nuclear and AI are not similar policy problems—AI policy is loosely defined with disagreement over field boundaries."

Critical gap: Companies plan to scale frontier AI systems 100-1000x in effective compute over the next 3-5 years. Without coordinated international licensing and oversight, countries risk a "regulatory race to the bottom."

Implementation Roadmap

Phase 1: Immediate (0-2 years)

  • Redirect 20% of RLHF funding to interpretability research
  • Establish AI Control research programs at major labs
  • Implement capability evaluation standards across industry
  • Strengthen export controls on AI hardware

Phase 2: Medium-term (2-5 years)

  • Deploy interpretability tools for deception detection
  • Pilot AI Control systems in controlled environments
  • Establish international coordination mechanisms
  • Develop formal verification for critical systems

Phase 3: Long-term (5+ years)

  • Scale proven interventions to frontier models
  • Implement comprehensive governance frameworks
  • Address structural risks through institutional reform
  • Monitor intervention effectiveness and adapt

Current State & Trajectory

Capability Evaluation Ecosystem (2024-2025)

The capability evaluation landscape has matured significantly with multiple organizations now conducting systematic pre-deployment assessments:

OrganizationRoleKey ContributionsPartnerships
METRIndependent evaluatorARA (autonomous replication & adaptation) methodology; GPT-4, Claude evaluationsUK AISI, Anthropic, OpenAI
Apollo ResearchScheming detectionSafety cases framework; deception detectionUK AISI, Redwood, UC Berkeley
UK AISIGovernment evaluatorFirst government-led comprehensive model evaluationsMETR, Apollo, major labs
US AISI (NIST)Standards developmentAI RMF, evaluation guidelinesIndustry consortium
Model developersInternal evaluationPre-deployment testing per RSPsThird-party auditors

According to METR's December 2025 analysis, twelve companies have now published frontier AI safety policies, including commitments to capability evaluations for dangerous capabilities before deployment. A 2023 survey found 98% of AI researchers "somewhat or strongly agreed" that labs should conduct pre-deployment risk assessments and dangerous capabilities evaluations.

Funding Landscape (2024)

Intervention TypeAnnual FundingGrowth RateMajor Funders
RLHF/Alignment Training$100M+50%/yearOpenAI, Anthropic, Google DeepMind
Capability Evaluations$150M+80%/yearUK AISI, METR, industry labs
Interpretability$100M+60%/yearAnthropic, academic institutions
AI Control$10M+200%/yearRedwood Research, academic groups
Governance Research$10M+40%/yearGovAI, CSET

Industry Deployment Status

InterventionOpenAIAnthropicGoogleMetaAssessment
RLHF✓ Deployed✓ Deployed✓ Deployed✓ DeployedStandard practice
Constitutional AIPartial✓ DeployedDevelopingDevelopingEmerging standard
Red-teaming✓ Deployed✓ Deployed✓ Deployed✓ DeployedUniversal adoption
InterpretabilityResearch✓ ActiveResearchLimitedMixed implementation
AI ControlNoneResearchNoneNoneEarly research only

Key Cruxes and Expert Disagreements

High-Confidence Disagreements

QuestionOptimistic ViewPessimistic ViewEvidence Quality
Will interpretability scale?70% chance of success30% chance of successMedium - early results promising
Is deceptive alignment likely?20% probability80% probabilityLow - limited empirical data
Can governance keep pace?Institutions will adaptRegulatory capture inevitableMedium - historical precedent
Are current methods sufficient?Incremental progress worksNeed paradigm shiftMedium - deployment experience

Critical Research Questions

Key Questions

  • ?Will mechanistic interpretability scale to GPT-4+ sized models?
  • ?Can AI Control work against genuinely superintelligent systems?
  • ?Are current safety approaches creating a false sense of security?
  • ?Which governance interventions are politically feasible before catastrophe?
  • ?How do we balance transparency with competitive/security concerns?

Methodological Limitations

LimitationImpact on AnalysisMitigation Strategy
Sparse empirical dataEffectiveness estimates uncertainExpert elicitation, sensitivity analysis
Rapid capability growthIntervention relevance changingRegular reassessment, adaptive frameworks
Novel risk categoriesMatrix may miss emerging threatsHorizon scanning, red-team exercises
Deployment context dependenceLab results may not generalizeReal-world pilots, diverse testing

Sources & Resources

Meta-Analyses and Comprehensive Reports

ReportAuthors/OrganizationKey ContributionDate
International AI Safety Report 202596 experts from 30 countriesComprehensive assessment that no current method reliably prevents unsafe outputs2025
AI Safety IndexFuture of Life InstituteQuarterly tracking of AI safety progress across multiple dimensions2025
2025 Peregrine Report208 expert proposalsIn-depth interviews with major AI lab staff on risk mitigation2025
Mechanistic Interpretability ReviewBereska & GavvesComprehensive review of interpretability for AI safety2024
ITU AI Governance ReportInternational Telecommunication UnionGlobal state of AI governance2025

Primary Research Sources

CategorySourceKey ContributionQuality
Technical SafetyAnthropic Constitutional AICAI effectiveness dataHigh
Technical SafetyOpenAI InstructGPTRLHF deployment evidenceHigh
InterpretabilityAnthropic Scaling MonosemanticityInterpretability scaling resultsHigh
AI ControlGreenblatt et al. AI ControlControl theory frameworkMedium
EvaluationsMETR Dangerous CapabilitiesEvaluation methodologyMedium-High
Alignment FakingHubinger et al. 2024Empirical evidence on backdoor persistenceHigh

Policy and Governance Sources

OrganizationResourceFocus AreaReliability
CSETAI Governance DatabasePolicy landscape mappingHigh
GovAIGovernance researchInstitutional analysisHigh
RAND CorporationAI Risk AssessmentMilitary/security applicationsHigh
UK AISITesting reportsGovernment evaluation practiceMedium-High
US AISIGuidelines and standardsFederal AI policyMedium-High

Industry and Lab Resources

OrganizationResource TypeKey InsightsAccess
OpenAISafety researchRLHF deployment dataPublic
AnthropicResearch publicationsConstitutional AI, interpretabilityPublic
DeepMindSafety researchTechnical safety approachesPublic
Redwood ResearchAI Control researchControl methodology developmentPublic
METREvaluation frameworksCapability assessment toolsPartial

Expert Survey Data

SurveySample SizeKey FindingsConfidence
AI Impacts 2022738 expertsTimeline estimates, risk assessmentsMedium
FHI Expert Survey352 expertsExistential risk probabilitiesMedium
State of AI ReportIndustry dataDeployment and capability trendsHigh
Anthropic Expert Interviews45 researchersTechnical intervention effectivenessMedium-High

Additional Sources

SourceURLKey Contribution
Coefficient Giving 2024 Progressopenphilanthropy.orgFunding landscape, priorities
AI Safety Funding AnalysisEA ForumComprehensive funding breakdown
80,000 Hours AI Safety80000hours.orgFunding opportunities assessment
RLHF Alignment TaxACL AnthologyEmpirical alignment tax research
Safe RLHFICLR 2024Helpfulness/harmlessness balance
AI Control PaperarXivFoundational AI Control research
Redwood Research Blogredwoodresearch.substack.comAI Control developments
METR Safety Policiesmetr.orgIndustry policy analysis
GovAI Compute Governancegovernance.aiCompute governance analysis
RAND AI Diffusionrand.orgExport control effectiveness

Technical Risk Models

  • Deceptive Alignment Decomposition - Detailed analysis of key gap
  • AI Safety Defense in Depth Model - How interventions layer
  • AI Capability Threshold Model - When interventions become insufficient

Governance and Strategy

  • AI Risk Portfolio Analysis - Risk portfolio construction
  • Capabilities to Safety Pipeline - Research translation challenges
  • AI Risk Critical Uncertainties Model - Key unknowns affecting prioritization

Implementation Resources

  • Responsible Scaling Policies - Industry implementation
  • Safety Research Organizations - Key players and capacity
  • Evaluation Frameworks - Assessment methodologies

References

OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.

★★★★☆

Google DeepMind is a leading AI research laboratory combining the former DeepMind and Google Brain teams, focused on developing advanced AI systems and conducting research across capabilities, safety, and applications. The organization is one of the most influential labs in AI development, working on frontier models including Gemini and publishing widely-cited safety and capabilities research.

★★★★☆

This paper introduces InstructGPT, a method for aligning language models with human intent using Reinforcement Learning from Human Feedback (RLHF). By fine-tuning GPT-3 with human preference data, the authors demonstrate that smaller aligned models can outperform much larger unaligned models on user-preferred outputs. The work establishes RLHF as a foundational technique for making LLMs safer and more helpful.

★★★☆☆

The Center for AI Safety's 2024 annual review highlights major research achievements including circuit breakers for preventing dangerous AI outputs, the WMDP benchmark for measuring hazardous knowledge, HarmBench for red teaming evaluation, and tamper-resistant safeguards for open-weight models. The review also covers advocacy efforts including the CAIS Action Fund and support for AI safety legislation. These projects span technical safety research, evaluation frameworks, and policy advocacy.

★★★★☆
5AI Control FrameworkarXiv·Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan & Fabien Roger·2023·Paper

This paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversight. It proposes evaluation protocols and mechanisms to maintain safety against intentional subversion by advanced AI models, treating safety as a red-team/blue-team problem between AI and human overseers.

★★★☆☆

This resource is unavailable due to a 404 error, indicating the page has been moved or retired from the RAND Corporation website. No substantive content can be assessed.

★★★★☆
7Oxford International AffairsOxford Academic (peer-reviewed)

This article from the journal International Affairs (Oxford) addresses AI governance and its implications for international security and global policy coordination. The piece likely examines how states and international institutions are responding to the challenges posed by advanced AI systems, including prioritization of interventions and their effectiveness.

★★★★★

The 2022 ESPAI surveyed 738 machine learning researchers (NeurIPS/ICML authors) about AI progress timelines and risks, serving as a replication and update of the 2016 survey. Key findings include an aggregate forecast of 50% chance of HLMI by 2059 (37 years from 2022), with significant disagreement among experts about timelines and risks.

★★★☆☆

The CSET AI Governance Database is a structured reference tool cataloging AI-related laws, regulations, strategies, and governance initiatives from governments and international bodies worldwide. It serves as a comprehensive resource for tracking the evolving global policy landscape around artificial intelligence. The database supports researchers and policymakers in comparative analysis of AI governance approaches.

★★★★☆
10Redwood Research: AI Controlredwoodresearch.org

Redwood Research is a nonprofit AI safety organization that pioneered the 'AI control' research agenda, focusing on preventing intentional subversion by misaligned AI systems. Their key contributions include the ICML paper on AI Control protocols, the Alignment Faking demonstration (with Anthropic), and consulting work with governments and AI labs on misalignment risk mitigation.

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

★★★★☆
12GovAI Research PublicationsCentre for the Governance of AI·Government

The Centre for the Governance of AI (GovAI) research hub aggregates policy-relevant technical and governance research on frontier AI systems, covering topics from biosecurity and cybercrime to labor market impacts and AI auditing. It serves as a comprehensive repository of GovAI's publications spanning multiple years and research themes. The page indexes papers addressing near-term and long-term risks from advanced AI systems.

★★★★☆

This page appears to be a research updates feed from the Machine Intelligence Research Institute (MIRI), but the content is currently unavailable (404 error). MIRI focuses on technical AI safety research, particularly on aligned AI and decision theory.

★★★☆☆
14AI Safety Institute - GOV.UKUK Government·Government

The UK AI Safety Institute (recently rebranded as the AI Security Institute) is a government body under the Department for Science, Innovation and Technology focused on minimizing risks from rapid and unexpected AI advances. It conducts and publishes safety research, international coordination reports, and policy guidance, while managing grants for systemic AI safety research.

★★★★☆

OpenAI's central safety page providing updates on their approach to AI safety research, deployment practices, and ongoing safety commitments. It serves as a hub for information on OpenAI's safety-related initiatives, policies, and technical work aimed at ensuring their AI systems are safe and beneficial.

★★★★☆
16Guidelines and standardsNIST·Government

NIST's AI hub provides foundational guidelines, standards, and governance frameworks for responsible AI development, centered on the AI Risk Management Framework (AI RMF). As a nonregulatory federal agency, NIST promotes trustworthy AI through measurement science, voluntary technical standards, and stakeholder collaboration to balance innovation with risk mitigation.

★★★★★
17Hubinger et al. (2024)arXiv·Shanshan Han·2024·Paper

This vision paper by Hubinger et al. (2024) proposes a long-term blueprint for advanced human society to guide current AI safety efforts. The authors project a future centered on the Internet of Everything and map technological advancements across stages, forecasting potential AI safety challenges at each phase. By comparing current safety initiatives against this long-term vision, the paper identifies gaps and emerging priorities for AI safety practitioners in the 2020s, arguing that safety efforts must balance addressing immediate concerns with anticipating risks in an expanding AI landscape.

★★★☆☆
18Brookings AI governance trackerBrookings Institution

The Brookings Institution maintains an AI governance tracker that monitors policy developments, regulatory proposals, and legislative actions related to artificial intelligence across jurisdictions. It serves as a reference resource for tracking the evolving landscape of AI governance initiatives globally.

★★★★☆
19AI Safety Index Winter 2025Future of Life Institute

The Future of Life Institute evaluated eight major AI companies across 35 safety indicators, finding widespread deficiencies in risk management and existential safety practices. Even top performers Anthropic and OpenAI received only marginal passing grades, highlighting systemic gaps across the industry in preparedness for advanced AI risks.

★★★☆☆

Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.

★★★★☆
21International AI Safety Report 2025internationalaisafetyreport.org

A landmark international scientific assessment co-authored by 96 experts from 30 countries, providing a comprehensive overview of general-purpose AI capabilities, risks, and risk management approaches. It aims to establish shared scientific understanding across nations as a foundation for global AI governance. The report covers topics including capability evaluation, misuse risks, systemic risks, and mitigation strategies.

22Sparse AutoencodersarXiv·Leonard Bereska & Efstratios Gavves·2024·Paper

This review examines mechanistic interpretability—the process of reverse-engineering neural networks to understand their computational mechanisms and learned representations in human-understandable terms. The authors establish foundational concepts around how features encode knowledge in neural activations, survey methodologies for causally analyzing model behaviors, and assess mechanistic interpretability's relevance to AI safety. They discuss potential benefits for understanding and controlling AI systems, alongside risks such as capability gains and dual-use concerns, while identifying key challenges in scalability and automation. The authors argue that advancing mechanistic interpretability techniques is essential for preventing catastrophic outcomes as AI systems become increasingly powerful and opaque.

★★★☆☆
23IAPS governance researchInstitute for AI Policy and Strategy

IAPS (Institute for AI Policy and Strategy) is a research organization focused on AI governance, policy analysis, and strategic interventions to reduce risks from advanced AI systems. It conducts research on effective policy levers, international coordination, and prioritization of governance efforts to improve AI safety outcomes.

★★★★☆
242025 Peregrine Reportriskmitigation.ai

The 2025 Peregrine Report appears to be an analysis or evaluation of AI risk mitigation strategies, likely assessing the effectiveness and prioritization of various interventions aimed at reducing AI-related harms. Without access to the full content, it is understood to focus on actionable frameworks for identifying and implementing the most impactful safety measures.

The ITU's 2025 AI Governance Report provides a comprehensive overview of global AI governance developments, frameworks, and policy trends from an international telecommunications and ICT standards perspective. It examines how nations and international bodies are approaching AI regulation, safety standards, and coordination challenges. The report serves as a reference document for policymakers and stakeholders navigating the evolving AI governance landscape.

26DeepMind BlogGoogle DeepMind

The DeepMind blog serves as the official publication hub for Google DeepMind, featuring research announcements, technical breakthroughs, and commentary on AI development including safety-relevant work. It covers topics ranging from scientific applications to AI safety and alignment research. The blog is a primary source for understanding DeepMind's research agenda and public positions on AI.

★★★★☆
27FHI expert elicitationFuture of Humanity Institute

This resource from the Future of Humanity Institute (FHI) at Oxford involves expert elicitation surveys focused on AI development timelines, capability thresholds, and prioritization of interventions. It aggregates forecasts from researchers to inform understanding of when transformative AI might arrive and what safety measures may be most effective.

★★★★☆

METR (formerly ARC Evals) describes their framework for evaluating potentially dangerous capabilities in frontier AI models, including autonomous replication, acquiring resources, and assisting with weapons development. The post outlines their methodology for assessing whether models pose catastrophic risks and how these evaluations inform deployment decisions. It represents a key practical approach to pre-deployment safety testing.

★★★★☆
29CSET: AI Market DynamicsCSET Georgetown

CSET (Center for Security and Emerging Technology) at Georgetown University is a policy research organization focused on the security implications of emerging technologies, particularly AI. It produces research on AI policy, workforce, geopolitics, and governance. The content could not be fully extracted, limiting detailed analysis.

★★★★☆

The Centre for the Governance of AI (GovAI) is a leading research organization dedicated to helping decision-makers navigate the transition to a world with advanced AI. It produces rigorous research on AI governance, policy, and societal impacts, while fostering a global talent pipeline for responsible AI oversight. GovAI bridges technical AI safety concerns with practical policy recommendations.

★★★★☆

Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigating risks from increasingly capable AI systems. It serves as a central hub for their published findings and ongoing safety-focused investigations.

★★★★☆

Open Philanthropy reviews its 2024 philanthropic activities and outlines priorities for 2025, with emphasis on AI safety research funding, strategic partnerships, and grants spanning global health and catastrophic risk reduction. The report provides transparency into one of the field's largest funders and signals where major resources will flow in the AI safety ecosystem.

★★★★☆
33Redwood Research's AI Control paper (December 2023)arXiv·Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan & Fabien Roger·2023·Paper

This paper introduces 'AI control' as a framework for maintaining safety guarantees when deploying powerful but potentially misaligned models, developing protocols that combine weaker trusted models with stronger untrusted models. Two main protocols—trusted editing and untrusted monitoring—are evaluated against adversarial strategies where GPT-4 actively attempts to insert undetectable backdoors into code. Both protocols substantially outperform simple baselines, demonstrating that useful work can be extracted from untrusted models while preserving safety.

★★★☆☆
34GovAI - Computing Power and the Governance of AICentre for the Governance of AI·Government

A GovAI report examining compute governance as a lever for AI policy, arguing that AI chips' detectability, excludability, and quantifiability make compute a uniquely tractable governance target. The report covers mechanisms like tracking, subsidizing, restricting access, and embedding hardware guardrails, while cautioning that compute governance carries risks of civil liberties violations, power concentration, and authoritarian misuse.

★★★★☆

This RAND Corporation research publication (PEA3776-1) addresses policy and governance considerations related to artificial intelligence, likely examining risks, regulatory frameworks, or national security implications of advanced AI systems. Without access to the full content, the resource appears to be a RAND 'Perspectives' paper, which typically offers analysis and recommendations on emerging policy challenges.

★★★★☆

METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced AI systems. The analysis synthesizes patterns across responsible scaling policies, model cards, and safety frameworks to provide a comparative overview of industry norms. It serves as a reference for understanding where consensus exists and where significant variation or absence of commitments remains.

★★★★☆

Related Wiki Pages

Top Related Pages

Risks

SchemingDeceptive AlignmentGoal MisgeneralizationTreacherous Turn

Approaches

AI EvaluationConstitutional AI

Analysis

Capabilities-to-Safety Pipeline ModelAI Capability Threshold ModelSafety-Capability Tradeoff ModelDeceptive Alignment Decomposition Model

Organizations

Redwood ResearchAnthropicMETRApollo ResearchOpenAI

Policy

AI Safety Institutes (AISIs)EU AI Act

Other

Scalable Oversight

Key Debates

Technical AI Safety Research