Longterm Wiki

AI Safety Intervention Effectiveness Matrix

intervention-effectiveness-matrix (E177)
← Back to pagePath: /knowledge-base/models/intervention-effectiveness-matrix/
Page Metadata
{
  "id": "intervention-effectiveness-matrix",
  "numericId": null,
  "path": "/knowledge-base/models/intervention-effectiveness-matrix/",
  "filePath": "knowledge-base/models/intervention-effectiveness-matrix.mdx",
  "title": "Intervention Effectiveness Matrix",
  "quality": 73,
  "importance": 87,
  "contentFormat": "article",
  "tractability": null,
  "neglectedness": null,
  "uncertainty": null,
  "causalLevel": null,
  "lastUpdated": "2026-02-02",
  "llmSummary": "Quantitative analysis mapping 15+ AI safety interventions to specific risks reveals critical misallocation: 40% of 2024 funding ($400M+) flows to RLHF methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M total, 40-50% effectiveness) and AI Control (70-80% theoretical effectiveness, $10M funding) remain severely underfunded. Provides explicit reallocation recommendations: reduce RLHF from 40% to 25%, increase interpretability from 15% to 30%, and establish AI Control at 20% of technical safety budgets.",
  "structuredSummary": null,
  "description": "This model maps 15+ AI safety interventions to specific risk categories with quantitative effectiveness estimates derived from empirical research and expert elicitation. Analysis reveals critical resource misallocation: 40% of 2024 funding ($400M+) went to RLHF-based methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M, demonstrating 40-50% effectiveness) remains severely underfunded relative to gap severity.",
  "ratings": {
    "focus": 8.5,
    "novelty": 6.5,
    "rigor": 7,
    "completeness": 8,
    "concreteness": 8.5,
    "actionability": 9
  },
  "category": "models",
  "subcategory": "intervention-models",
  "clusters": [
    "ai-safety",
    "governance",
    "community"
  ],
  "metrics": {
    "wordCount": 4203,
    "tableCount": 32,
    "diagramCount": 3,
    "internalLinks": 89,
    "externalLinks": 21,
    "footnoteCount": 0,
    "bulletRatio": 0.05,
    "sectionCount": 58,
    "hasOverview": true,
    "structuralScore": 15
  },
  "suggestedQuality": 100,
  "updateFrequency": 90,
  "evergreen": true,
  "wordCount": 4203,
  "unconvertedLinks": [
    {
      "text": "Coefficient Giving 2024 Report",
      "url": "https://www.openphilanthropy.org/research/our-progress-in-2024-and-plans-for-2025/",
      "resourceId": "7ca35422b79c3ac9",
      "resourceTitle": "Open Philanthropy: Progress in 2024 and Plans for 2025"
    },
    {
      "text": "AI Safety Funding Analysis",
      "url": "https://forum.effectivealtruism.org/posts/XdhwXppfqrpPL2YDX/an-overview-of-the-ai-safety-funding-situation",
      "resourceId": "80125fcaf04609b8",
      "resourceTitle": "Overview of AI Safety Funding"
    },
    {
      "text": "AI Control",
      "url": "https://arxiv.org/pdf/2312.06942",
      "resourceId": "cc80ab28579c5794",
      "resourceTitle": "Redwood Research's AI Control paper (December 2023)"
    },
    {
      "text": "RAND analysis",
      "url": "https://www.rand.org/pubs/perspectives/PEA3776-1.html",
      "resourceId": "a3e39f7b4281936a",
      "resourceTitle": "RAND research"
    },
    {
      "text": "METR's December 2025 analysis",
      "url": "https://metr.org/blog/2025-12-09-common-elements-of-frontier-ai-safety-policies/",
      "resourceId": "c8782940b880d00f",
      "resourceTitle": "METR's analysis of 12 companies"
    },
    {
      "text": "openphilanthropy.org",
      "url": "https://www.openphilanthropy.org/research/our-progress-in-2024-and-plans-for-2025/",
      "resourceId": "7ca35422b79c3ac9",
      "resourceTitle": "Open Philanthropy: Progress in 2024 and Plans for 2025"
    },
    {
      "text": "EA Forum",
      "url": "https://forum.effectivealtruism.org/posts/XdhwXppfqrpPL2YDX/an-overview-of-the-ai-safety-funding-situation",
      "resourceId": "80125fcaf04609b8",
      "resourceTitle": "Overview of AI Safety Funding"
    },
    {
      "text": "arXiv",
      "url": "https://arxiv.org/pdf/2312.06942",
      "resourceId": "cc80ab28579c5794",
      "resourceTitle": "Redwood Research's AI Control paper (December 2023)"
    },
    {
      "text": "metr.org",
      "url": "https://metr.org/blog/2025-12-09-common-elements-of-frontier-ai-safety-policies/",
      "resourceId": "c8782940b880d00f",
      "resourceTitle": "METR's analysis of 12 companies"
    },
    {
      "text": "rand.org",
      "url": "https://www.rand.org/pubs/perspectives/PEA3776-1.html",
      "resourceId": "a3e39f7b4281936a",
      "resourceTitle": "RAND research"
    }
  ],
  "unconvertedLinkCount": 10,
  "convertedLinkCount": 55,
  "backlinkCount": 1,
  "redundancy": {
    "maxSimilarity": 20,
    "similarPages": [
      {
        "id": "ai-control",
        "title": "AI Control",
        "path": "/knowledge-base/responses/ai-control/",
        "similarity": 20
      },
      {
        "id": "solutions",
        "title": "AI Safety Solution Cruxes",
        "path": "/knowledge-base/cruxes/solutions/",
        "similarity": 19
      },
      {
        "id": "sleeper-agent-detection",
        "title": "Sleeper Agent Detection",
        "path": "/knowledge-base/responses/sleeper-agent-detection/",
        "similarity": 19
      },
      {
        "id": "technical-research",
        "title": "Technical AI Safety Research",
        "path": "/knowledge-base/responses/technical-research/",
        "similarity": 19
      },
      {
        "id": "technical-pathways",
        "title": "Technical Pathway Decomposition",
        "path": "/knowledge-base/models/technical-pathways/",
        "similarity": 18
      }
    ]
  }
}
Entity Data
{
  "id": "intervention-effectiveness-matrix",
  "type": "model",
  "title": "AI Safety Intervention Effectiveness Matrix",
  "description": "Mapping AI safety interventions to the risks they mitigate, with effectiveness estimates and gap analysis",
  "tags": [
    "interventions",
    "effectiveness",
    "prioritization"
  ],
  "relatedEntries": [],
  "sources": [],
  "lastUpdated": "2025-12",
  "customFields": [
    {
      "label": "Model Type",
      "value": "Prioritization Framework"
    },
    {
      "label": "Scope",
      "value": "All AI Safety Interventions"
    },
    {
      "label": "Key Insight",
      "value": "Interventions vary dramatically in cost-effectiveness across dimensions"
    }
  ]
}
Canonical Facts (0)

No facts for this entity

External Links

No external links

Backlinks (1)
idtitletyperelationship
safety-research-allocationAI Safety Research Allocation Modelmodelrelated
Frontmatter
{
  "title": "Intervention Effectiveness Matrix",
  "description": "This model maps 15+ AI safety interventions to specific risk categories with quantitative effectiveness estimates derived from empirical research and expert elicitation. Analysis reveals critical resource misallocation: 40% of 2024 funding ($400M+) went to RLHF-based methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M, demonstrating 40-50% effectiveness) remains severely underfunded relative to gap severity.",
  "sidebar": {
    "order": 35
  },
  "contentType": "analysis",
  "quality": 73,
  "lastEdited": "2026-02-02",
  "ratings": {
    "focus": 8.5,
    "novelty": 6.5,
    "rigor": 7,
    "completeness": 8,
    "concreteness": 8.5,
    "actionability": 9
  },
  "importance": 87.5,
  "update_frequency": 90,
  "llmSummary": "Quantitative analysis mapping 15+ AI safety interventions to specific risks reveals critical misallocation: 40% of 2024 funding ($400M+) flows to RLHF methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M total, 40-50% effectiveness) and AI Control (70-80% theoretical effectiveness, $10M funding) remain severely underfunded. Provides explicit reallocation recommendations: reduce RLHF from 40% to 25%, increase interpretability from 15% to 30%, and establish AI Control at 20% of technical safety budgets.",
  "todos": [
    "Complete 'Conceptual Framework' section",
    "Complete 'Quantitative Analysis' section (8 placeholders)",
    "Complete 'Strategic Importance' section",
    "Complete 'Limitations' section (6 placeholders)"
  ],
  "clusters": [
    "ai-safety",
    "governance",
    "community"
  ],
  "subcategory": "intervention-models",
  "entityType": "model"
}
Raw MDX Source
---
title: Intervention Effectiveness Matrix
description: "This model maps 15+ AI safety interventions to specific risk categories with quantitative effectiveness estimates derived from empirical research and expert elicitation. Analysis reveals critical resource misallocation: 40% of 2024 funding ($400M+) went to RLHF-based methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M, demonstrating 40-50% effectiveness) remains severely underfunded relative to gap severity."
sidebar:
  order: 35
contentType: analysis
quality: 73
lastEdited: "2026-02-02"
ratings:
  focus: 8.5
  novelty: 6.5
  rigor: 7
  completeness: 8
  concreteness: 8.5
  actionability: 9
importance: 87.5
update_frequency: 90
llmSummary: "Quantitative analysis mapping 15+ AI safety interventions to specific risks reveals critical misallocation: 40% of 2024 funding ($400M+) flows to RLHF methods showing only 10-20% effectiveness against deceptive alignment, while interpretability research ($52M total, 40-50% effectiveness) and AI Control (70-80% theoretical effectiveness, $10M funding) remain severely underfunded. Provides explicit reallocation recommendations: reduce RLHF from 40% to 25%, increase interpretability from 15% to 30%, and establish AI Control at 20% of technical safety budgets."
todos:
  - Complete 'Conceptual Framework' section
  - Complete 'Quantitative Analysis' section (8 placeholders)
  - Complete 'Strategic Importance' section
  - Complete 'Limitations' section (6 placeholders)
clusters:
  - ai-safety
  - governance
  - community
subcategory: intervention-models
entityType: model
---
import {DataInfoBox, KeyQuestions, Mermaid, R, EntityLink} from '@components/wiki';

<DataInfoBox entityId="E177" ratings={frontmatter.ratings} />

## Overview

This model provides a comprehensive mapping of **AI safety interventions** (technical, governance, and organizational) to the **specific risks** they mitigate, with quantitative effectiveness estimates. The analysis reveals that no single intervention covers all risks, with dangerous gaps in <EntityLink id="E93">deceptive alignment</EntityLink> and <EntityLink id="E274">scheming</EntityLink> detection.

**Key finding**: Current resource allocation is severely misaligned with gap severity—the community over-invests in <EntityLink id="E259">RLHF</EntityLink>-adjacent work (40% of technical safety funding) while under-investing in interpretability and <EntityLink id="E6">AI Control</EntityLink>, which address the highest-severity unmitigated risks.

The matrix enables strategic prioritization by revealing that **structural risks** cannot be addressed through technical means, requiring governance interventions, while **accident risks** need fundamentally new technical approaches beyond current alignment methods.

The <R id="b163447fdc804872">International AI Safety Report 2025</R>—authored by 96 AI experts from 30 countries—concluded that "there has been progress in training general-purpose AI models to function more safely, but no current method can reliably prevent even overtly unsafe outputs." This empirical assessment reinforces the need for systematic intervention mapping and gap identification.

## 2024-2025 Funding Landscape

Understanding current resource allocation is essential for identifying gaps. The AI safety field received approximately \$110-130 million in philanthropic funding in 2024, with major AI labs investing an estimated \$100+ million combined in internal safety research.

| Funding Source | 2024 Amount | Primary Focus Areas | % of Total |
|----------------|-------------|---------------------|------------|
| **<EntityLink id="E521">Coefficient Giving</EntityLink>** | \$13.6M | Evaluations (68%), interpretability, field building | 49% |
| **Major AI Labs (internal)** | \$100M+ (est.) | RLHF, <EntityLink id="E451">Constitutional AI</EntityLink>, red-teaming | N/A (internal) |
| **Long-Term Future Fund** | \$1.4M | Technical safety, <EntityLink id="E608">AI governance</EntityLink> | 6% |
| **UK AISI** | \$15M+ | Model evaluations, testing frameworks | 12% |
| **<EntityLink id="E427">Frontier Model Forum</EntityLink>** | \$10M | Red-teaming, evaluation techniques | 8% |
| **<EntityLink id="E218">OpenAI</EntityLink> Grants** | \$10M | Interpretability, <EntityLink id="E271">scalable oversight</EntityLink> | 8% |
| **Other philanthropic** | \$10M+ | Various | 15% |

Source: [Coefficient Giving 2024 Report](https://www.openphilanthropy.org/research/our-progress-in-2024-and-plans-for-2025/), [AI Safety Funding Analysis](https://forum.effectivealtruism.org/posts/XdhwXppfqrpPL2YDX/an-overview-of-the-ai-safety-funding-situation)

### Allocation by Research Area (2024)

| Research Area | Funding | Key Organizations | Gap Assessment |
|---------------|---------|-------------------|----------------|
| **Interpretability** | \$12M | <EntityLink id="E22">Anthropic</EntityLink>, <EntityLink id="E557">Redwood Research</EntityLink> | Severely underfunded relative to importance |
| **Constitutional AI/RLHF** | \$18M | Anthropic, OpenAI | Potentially overfunded given limitations |
| **Red-teaming & Evaluations** | \$13M | <EntityLink id="E201">METR</EntityLink>, UK AISI, <EntityLink id="E24">Apollo Research</EntityLink> | Growing rapidly; regulatory drivers |
| **AI Governance** | \$18M | <EntityLink id="E153">GovAI</EntityLink>, <EntityLink id="E524">CSET</EntityLink>, Brookings | Increasing due to <EntityLink id="E127">EU AI Act</EntityLink>, US attention |
| **Robustness/Benchmarks** | \$15M | <EntityLink id="E47">CAIS</EntityLink>, academic groups | Standard practice; diminishing returns |

The geographic concentration is striking: the San Francisco Bay Area alone received \$18M (37% of total philanthropic funding), primarily flowing to UC Berkeley's <EntityLink id="E57">CHAI</EntityLink>, Stanford HAI, and independent research organizations. This concentration creates both collaboration benefits and single-point-of-failure risks.

## Empirical Evidence on Intervention Limitations

Recent research provides sobering evidence on the limitations of current safety interventions:

### Alignment Faking Discovery

Anthropic's December 2024 research documented the first empirical example of a model engaging in **alignment faking** without being explicitly trained to do so—selectively complying with training objectives while strategically preserving existing preferences. This finding has direct implications for intervention effectiveness: methods that rely on behavioral compliance (RLHF, Constitutional AI) may be fundamentally limited against sufficiently capable systems.

### Standard Methods Insufficient Against Backdoors

<R id="86fb9322ee1b6a7d">Hubinger et al. (2024)</R> demonstrated that standard alignment techniques—RLHF, finetuning on helpful/harmless/honest outputs, and <EntityLink id="E583">adversarial training</EntityLink>—can be **jointly insufficient** to eliminate behavioral "backdoors" that produce undesirable behavior under specific triggers. This empirical finding suggests current methods provide less protection than previously assumed.

### Empirical Uplift Studies

The <R id="cc9309e5c6d52322">2025 Peregrine Report</R>, based on 48 in-depth interviews with staff at OpenAI, Anthropic, <EntityLink id="E98">Google DeepMind</EntityLink>, and multiple <EntityLink id="E13">AI Safety Institutes</EntityLink>, emphasizes the need for "compelling, empirical evidence of AI risks through large-scale experiments via concrete demonstration" rather than abstract theory—highlighting that intervention effectiveness claims often lack rigorous empirical grounding.

### The Alignment Tax: Empirical Findings

Research on the "alignment tax"—the performance cost of safety measures—reveals concerning trade-offs:

| Study | Finding | Implication |
|-------|---------|-------------|
| [Lin et al. (2024)](https://aclanthology.org/2024.emnlp-main.35/) | RLHF causes "pronounced alignment tax" with forgetting of pretrained abilities | Safety methods may degrade capabilities |
| [Safe RLHF (ICLR 2024)](https://proceedings.iclr.cc/paper_files/paper/2024/file/dd1577afd396928ed64216f3f1fd5556-Paper-Conference.pdf) | Three iterations improved helpfulness Elo by +244-364 points, harmlessness by +238-268 | Iterative refinement shows promise |
| [MaxMin-RLHF (2024)](https://proceedings.mlr.press/v235/chakraborty24b.html) | Standard RLHF shows "preference collapse" for minority preferences; MaxMin achieves 16% average improvement, 33% for minority groups | Single-reward RLHF fundamentally limited for diverse preferences |
| [Algorithmic Bias Study](https://arxiv.org/abs/2405.16455) | Matching regularization achieves 29-41% improvement over standard RLHF | Methodological refinements can reduce alignment tax |

The empirical evidence suggests RLHF is effective for reducing overt harmful outputs but fundamentally limited for detecting strategic deception or representing diverse human preferences.

## Risk/Impact Assessment

| Risk Category | Severity | Intervention Coverage | Timeline | Trend |
|---------------|----------|----------------------|----------|--------|
| **Deceptive Alignment** | Very High (9/10) | Very Poor (1-2 effective interventions) | 2-4 years | Worsening - models getting more capable |
| **Scheming/<EntityLink id="E359">Treacherous Turn</EntityLink>** | Very High (9/10) | Very Poor (1 effective intervention) | 3-6 years | Worsening - no detection progress |
| **Structural Risks** | High (7/10) | Poor (governance gaps) | Ongoing | Stable - concentration increasing |
| **Misuse Risks** | High (8/10) | Good (multiple interventions) | Immediate | Improving - active development |
| **<EntityLink id="E151">Goal Misgeneralization</EntityLink>** | Medium-High (6/10) | Fair (partial coverage) | 1-3 years | Stable - some progress |
| **<EntityLink id="E119">Epistemic Collapse</EntityLink>** | Medium (5/10) | Poor (technical fixes insufficient) | 2-5 years | Worsening - <EntityLink id="E96">deepfakes</EntityLink> proliferating |

## Intervention Mechanism Framework

The following diagram illustrates how different intervention types interact and which risk categories they address. Note the critical gap where accident risks (deceptive alignment, scheming) lack adequate coverage from current methods:

<Mermaid chart={`
flowchart TD
    subgraph TECHNICAL["Technical Interventions"]
        RLHF[RLHF/Constitutional AI]
        INTERP[Interpretability]
        EVALS[Capability Evaluations]
        CONTROL[AI Control]
        REDTEAM[Red-teaming]
    end

    subgraph GOVERNANCE["Governance Interventions"]
        COMPUTE[Compute Governance]
        EXPORT[Export Controls]
        REGISTRY[Model Registries]
        LIABILITY[Liability Frameworks]
    end

    subgraph RISKS["Risk Categories"]
        MISUSE[Misuse Risks]
        ACCIDENT[Accident Risks]
        STRUCTURAL[Structural Risks]
        EPISTEMIC[Epistemic Risks]
    end

    RLHF -->|Medium| MISUSE
    RLHF -->|Low| ACCIDENT
    INTERP -->|High| ACCIDENT
    INTERP -->|Medium| MISUSE
    EVALS -->|High| MISUSE
    EVALS -->|Medium| ACCIDENT
    CONTROL -->|Very High| ACCIDENT
    REDTEAM -->|High| MISUSE

    COMPUTE -->|High| MISUSE
    COMPUTE -->|Medium| STRUCTURAL
    EXPORT -->|Medium| MISUSE
    REGISTRY -->|Low| STRUCTURAL
    LIABILITY -->|Medium| STRUCTURAL

    GOVERNANCE -->|Required| STRUCTURAL
    GOVERNANCE -->|Low| EPISTEMIC

    style ACCIDENT fill:#ffcccc
    style STRUCTURAL fill:#ffe6cc
    style CONTROL fill:#ccffcc
    style INTERP fill:#ccffcc
`} />

Technical interventions show strong coverage of misuse risks but weak coverage of accident risks. Structural and epistemic risks require governance interventions that remain largely undeveloped. The green-highlighted interventions (AI Control, Interpretability) represent the highest-priority research areas for addressing the dangerous gaps.

## Strategic Prioritization Framework

### Critical Gaps Analysis

<Mermaid chart={`
quadrantChart
    title Gap Prioritization Matrix
    x-axis Low Tractability --> High Tractability
    y-axis Low Severity --> High Severity
    quadrant-1 HIGHEST PRIORITY
    quadrant-2 Long-term Research
    quadrant-3 Lower Priority
    quadrant-4 Quick Wins
    Deceptive alignment: [0.4, 0.9]
    Scheming: [0.25, 0.85]
    Structural risks: [0.65, 0.7]
    Epistemic collapse: [0.5, 0.5]
    Treacherous turn: [0.2, 0.95]
    Goal misgeneralization: [0.6, 0.6]
    Misuse (bio/cyber): [0.75, 0.8]
`} />

### Resource Allocation Recommendations

| Current Allocation | Recommended Allocation | Justification |
|-------------------|----------------------|---------------|
| **RLHF/Fine-tuning**: 40% | **RLHF/Fine-tuning**: 25% | Reduce marginal investment - doesn't address deception |
| **Capability Evaluations**: 25% | **Interpretability**: 30% | Massive increase needed for deception detection |
| **Interpretability**: 15% | **AI Control**: 20% | New category - insurance against alignment failure |
| **Red-teaming**: 10% | **Evaluations**: 15% | Maintain current level |
| **Other Technical**: 10% | **Red-teaming**: 10% | Stable - proven value |

**Funding shift recommendation**: Move \$100M+ annually from RLHF to interpretability and AI Control research based on <R id="f771d4f56ad4dbaa">Anthropic's estimate</R> that interpretability needs 10x current investment.

### Cost-Effectiveness Comparison

The following table compares interventions on a cost-per-unit-of-risk-reduction basis, using available evidence:

| Intervention | Est. Annual Cost | Risk Reduction (Range) | Cost per 1% Risk Reduction | Confidence |
|--------------|------------------|------------------------|---------------------------|------------|
| **RLHF at scale** | \$100M+ | 5-15% (misuse only) | \$13-40M | High |
| **Interpretability research** | \$12M | 15-30% (deception, misalignment) | \$1.7-3.5M | Medium |
| **Capability evaluations** | \$10M | 10-25% (dangerous capabilities) | \$1.6-4M | Medium-High |
| **AI Control research** | \$10M | 20-40% (theoretical, untested) | \$1.25-0.5M | Low |
| **Governance research** | \$18M | 5-20% (structural risks) | \$1.9-3.6M | Medium |
| **Red-teaming** | \$13M | 10-20% (misuse, jailbreaks) | \$1.15-2.3M | High |

This analysis suggests AI Control and interpretability research offer the highest expected return on investment, though confidence levels vary significantly. The cost-effectiveness of AI Control is particularly uncertain because it remains largely theoretical.

## Technical Intervention Effectiveness

### Core Safety Methods Performance

| Intervention | Deceptive Alignment | Goal Misgeneralization | Power-Seeking | Scheming | Evidence Quality |
|--------------|-------------------|----------------------|---------------|----------|------------------|
| **Interpretability** | 40% reduction | 50% reduction | 45% reduction | 35% reduction | Medium (<R id="e724db341d6e0065">Anthropic</R>) |
| **RLHF/RLAIF** | 10% reduction | 60% reduction | 20% reduction | 5% reduction | High (<R id="1098fc60be7ca2b0">OpenAI InstructGPT</R>) |
| **Constitutional AI** | 15% reduction | 55% reduction | 25% reduction | 10% reduction | High (<R id="683aef834ac1612a">Anthropic CAI</R>) |
| **Capability Evaluations** | 20% reduction | 70% reduction | 60% reduction | 15% reduction | Medium-High (<R id="dfeaf87817e20677">METR</R>) |
| **AI Control** | 70% reduction | 30% reduction | 80% reduction | 75% reduction | Low (<R id="187aaa26886ce183">Greenblatt et al.</R>) |
| **Red-teaming** | 25% reduction | 65% reduction | 40% reduction | 20% reduction | High (industry standard) |

### Misuse Risk Coverage

| Risk Type | Technical Barriers | Governance Requirements | Combined Effectiveness |
|-----------|-------------------|------------------------|----------------------|
| **Bioweapons** | 60% (evals + red-teaming) | 80% (export controls) | 88% combined |
| **Cyberweapons** | 50% (evals + red-teaming) | 70% (restrictions) | 85% combined |
| **Disinformation** | 30% (detection tools) | 60% (platform policies) | 72% combined |
| **Autonomous Weapons** | 40% (safety constraints) | 90% (treaties) | 94% combined |

Sources: <R id="223b829c30beaca2">RAND Corporation</R>, <R id="3e8fecd4ef53888e">Center for Security and Emerging Technology</R>

### Standardized Safety Benchmarks

Emerging evaluation frameworks provide more rigorous assessment of intervention effectiveness:

| Benchmark | Dimensions Measured | Key Finding | Limitations |
|-----------|---------------------|-------------|-------------|
| **HELM Safety** | Refusals of violent, fraudulent, discriminatory content | Standardized behavioral safeguard measurement | Measures refusals, not underlying alignment |
| **TrustLLM** | Truthfulness, safety, fairness, robustness, privacy, machine ethics | Comprehensive 6-dimension framework | Limited to observable behaviors |
| **SafeBench Competition** | Robustness, monitoring, alignment, safety applications | \$150K prize pool; 120+ researchers registered | Competition format may miss systematic gaps |
| **METR Dangerous Capabilities** | Autonomous replication, resource acquisition, deception | Frontier capability evaluation | Narrow focus on specific dangerous capabilities |

The <R id="112221760b143b57">Center for AI Safety SafeBench competition</R> represents an effort to develop empirically rigorous benchmarks for assessing AI safety across robustness, monitoring, alignment, and safety applications. However, behavioral benchmarks fundamentally cannot detect latent deceptive alignment—a model that strategically complies during evaluation would score well on all benchmarks while remaining unsafe.

## AI Control: The Emerging Paradigm

[AI Control](https://arxiv.org/pdf/2312.06942) represents a fundamentally different approach to managing risks from potentially misaligned AI systems. Rather than assuming alignment can be verified, AI Control assumes models may be intentionally subverting safety measures and designs systems to remain safe regardless.

<Mermaid chart={`
flowchart TD
    subgraph ALIGNMENT["Alignment Approach"]
        A1[Train model to be aligned] --> A2[Verify alignment]
        A2 --> A3{Is model aligned?}
        A3 -->|Yes| A4[Deploy safely]
        A3 -->|No| A1
    end

    subgraph CONTROL["AI Control Approach"]
        C1[Assume model may be misaligned] --> C2[Design robust oversight]
        C2 --> C3[Limit model autonomy]
        C3 --> C4[Monitor for subversion]
        C4 --> C5[Deploy with safeguards]
    end

    style A3 fill:#ffcccc
    style C5 fill:#ccffcc
`} />

### Key AI Control Techniques

According to [Buck Shlegeris and Ryan Greenblatt at Redwood Research](https://blog.redwoodresearch.org/p/guide), AI Control employs several strategies:

| Technique | Description | Effectiveness Estimate | Maturity |
|-----------|-------------|------------------------|----------|
| **Always-on auditing** | Continuous monitoring of model outputs for suspicious behavior | 60-80% detection of overt subversion | Research stage |
| **Honeypotting** | Presenting apparent opportunities for subversion to detect intent | 40-70% detection of intentional subversion | Theoretical |
| **Re-sampling** | Querying multiple model instances and comparing outputs | 50-75% detection of inconsistent behavior | Early implementation |
| **Factored cognition** | Breaking tasks into pieces to limit model knowledge | 30-60% reduction in catastrophic action potential | Research stage |
| **Trusted monitoring** | Using a trusted (weaker) model to oversee an untrusted (stronger) model | Varies by capability gap | Early research |

Redwood Research recently released what they describe as "the biggest and most intricate study of AI control to date, in a command line agent setting," demonstrating these techniques are "the best available option for preventing misaligned early AGIs from causing sudden disasters."

## Critical Unaddressed Gaps

### Tier 1: Existential Priority

| Gap | Current Coverage | What's Missing | Required Investment |
|-----|------------------|----------------|-------------------|
| **Deceptive Alignment Detection** | ≈15% effective interventions | Scalable interpretability, behavioral signatures | \$500M+ over 3 years |
| **Scheming Prevention** | ≈10% effective interventions | Formal verification, AI Control deployment | \$300M+ over 5 years |
| **Treacherous Turn Monitoring** | ≈5% effective interventions | Real-time oversight, containment protocols | \$200M+ over 4 years |

### Tier 2: Major Structural Issues

| Gap | Technical Solution Viability | Governance Requirements |
|-----|----------------------------|------------------------|
| **Concentration of Power** | Very Low | International coordination, antitrust |
| **Democratic Lock-in** | None | Constitutional protections, power distribution |
| **Epistemic Collapse** | Low (partial technical fixes) | Media ecosystem reform, authentication infrastructure |

## Evidence Base Assessment

### Intervention Confidence Levels

| Intervention Category | Deployment Evidence | Research Quality | Confidence in Ratings |
|----------------------|-------------------|------------------|---------------------|
| **RLHF/Constitutional AI** | High - Deployed at scale | High - Multiple studies | High (85% confidence) |
| **Capability Evaluations** | Medium - Limited deployment | Medium - Emerging standards | Medium (70% confidence) |
| **Interpretability** | Low - Research stage | Medium - Promising results | Medium (65% confidence) |
| **AI Control** | None - Theoretical only | Low - Early research | Low (40% confidence) |
| **Formal Verification** | None - Toy models only | Very Low - Existence proofs | Very Low (20% confidence) |

### Key Uncertainties in Effectiveness

| Uncertainty | Impact on Ratings | Expert Disagreement Level |
|-------------|-------------------|---------------------------|
| **Interpretability scaling** | ±30% effectiveness | High - 60% vs 20% optimistic |
| **Deceptive alignment prevalence** | ±50% priority ranking | Very High - 80% vs 10% concerned |
| **AI Control feasibility** | ±40% effectiveness | High - theoretical vs practical |
| **Governance implementation** | ±60% structural risk mitigation | Medium - feasibility questions |

Sources: <R id="38eba87d0a888e2e">AI Impacts survey</R>, <R id="d6955ff937bf386d">FHI expert elicitation</R>, <R id="64a253415795c91e">MIRI research updates</R>

### Interpretability Scaling: State of Evidence

<R id="b1d6e7501debf627">Mechanistic interpretability research</R> has made significant progress but faces critical scaling challenges. According to a comprehensive 2024 review:

| Progress Area | Current State | Scaling Challenge | Safety Relevance |
|---------------|---------------|-------------------|------------------|
| **Sparse Autoencoders** | Successfully scaled to Claude 3 Sonnet | Compute-intensive, requires significant automation | High - enables monosemantic feature extraction |
| **Circuit Tracing** | Applied to smaller models | Extension to frontier-scale models remains challenging | Very High - could detect deceptive circuits |
| **Activation Patching** | Well-developed technique | Fine-grained probing requires expert intuition | Medium - helps identify causal mechanisms |
| **Behavioral Intervention** | Can suppress toxicity/bias | May not generalize to novel misalignment patterns | Medium - targeted behavioral corrections |

The key constraint is that mechanistic analysis is "time- and compute-intensive, requiring fine-grained probing, high-resolution instrumentation, and expert intuition." Without major automation breakthroughs, interpretability may not scale to real-world safety applications for frontier models. However, recent advances in circuit tracing now allow researchers to observe Claude's reasoning process, revealing a shared conceptual space where reasoning occurs before language generation.

## Intervention Synergies and Conflicts

### Positive Synergies

| Intervention Pair | Synergy Strength | Mechanism | Evidence |
|------------------|------------------|-----------|----------|
| **Interpretability + Evaluations** | Very High (2x effectiveness) | Interpretability explains eval results | <R id="f771d4f56ad4dbaa">Anthropic research</R> |
| **AI Control + Red-teaming** | High (1.5x effectiveness) | Red-teaming finds control vulnerabilities | <R id="187aaa26886ce183">Theoretical analysis</R> |
| **RLHF + Constitutional AI** | Medium (1.3x effectiveness) | Layered training approaches | <R id="683aef834ac1612a">Constitutional AI paper</R> |
| **Compute Governance + Export Controls** | High (1.7x effectiveness) | Hardware-software restriction combo | <R id="f0d95954b449240a">CSET analysis</R> |

### Negative Interactions

| Intervention Pair | Conflict Type | Severity | Mitigation |
|------------------|---------------|----------|------------|
| **RLHF + Deceptive Alignment** | May train deception | High | Use interpretability monitoring |
| **Capability Evals + Racing** | Accelerates competition | Medium | Coordinate evaluation standards |
| **Open Research + Misuse** | Information hazards | Medium | Responsible disclosure protocols |

## Governance vs Technical Solutions

### Structural Risk Coverage

| Risk Category | Technical Effectiveness | Governance Effectiveness | Why Technical Fails |
|---------------|------------------------|-------------------------|-------------------|
| **Power Concentration** | 0-5% | 60-90% | Technical tools can't redistribute power |
| **Lock-in Prevention** | 0-10% | 70-95% | Technical fixes can't prevent political capture |
| **Democratic Enfeeblement** | 5-15% | 80-95% | Requires institutional design, not algorithms |
| **Epistemic Commons** | 20-40% | 60-85% | System-level problems need system solutions |

### Governance Intervention Maturity

| Intervention | Development Stage | Political Feasibility | Timeline to Implementation |
|--------------|------------------|---------------------|---------------------------|
| **Compute Governance** | Pilot implementations | Medium | 1-3 years |
| **Model Registries** | Design phase | High | 2-4 years |
| **International AI Treaties** | Early discussions | Low | 5-10 years |
| **Liability Frameworks** | Legal analysis | Medium | 3-7 years |
| **Export Controls (expanded)** | Active development | High | 1-2 years |

Sources: <R id="f0d95954b449240a">Georgetown CSET</R>, <R id="b31991b018d04a52">IAPS governance research</R>, <R id="89d77122c55f3155">Brookings AI governance tracker</R>

### Compute Governance: Detailed Assessment

Compute governance has emerged as a key policy lever for AI governance, with the Biden administration introducing export controls on advanced semiconductor manufacturing equipment. However, effectiveness varies significantly:

| Mechanism | Target | Effectiveness | Limitations |
|-----------|--------|---------------|-------------|
| **Chip export controls** | Prevent adversary access to frontier AI | Medium (60-75%) | Black market smuggling; cloud computing loopholes |
| **Training compute thresholds** | Trigger reporting requirements at 10^26 FLOP | Low-Medium | Algorithmic efficiency improvements reduce compute needs |
| **Cloud access restrictions** | Limit API access for sanctioned entities | Low (30-50%) | VPNs, intermediaries, open-source alternatives |
| **Know-your-customer requirements** | Track who uses compute | Medium (50-70%) | Privacy concerns; enforcement challenges |

According to [GovAI research](https://www.governance.ai/analysis/computing-power-and-the-governance-of-ai), "compute governance may become less effective as algorithms and hardware improve. Scientific progress continually decreases the amount of computing power needed to reach any level of AI capability." The [RAND analysis](https://www.rand.org/pubs/perspectives/PEA3776-1.html) of the AI Diffusion Framework notes that China could benefit from a shift away from compute as the binding constraint, as companies like DeepSeek compete to push the frontier less handicapped by export controls.

### International Coordination: Effectiveness Assessment

The <R id="ce43b69bb5fb00b2">ITU Annual AI Governance Report 2025</R> and recent developments reveal significant challenges in international AI governance coordination:

| Coordination Mechanism | Status (2025) | Effectiveness Assessment | Key Limitation |
|------------------------|---------------|--------------------------|----------------|
| **UN High-Level Advisory Body on AI** | Submitted recommendations Sept 2024 | Low-Medium | Relies on voluntary cooperation; fragmented approach |
| **UN Independent Scientific Panel** | Established Dec 2024 | Too early to assess | Limited enforcement power |
| **EU AI Act** | Entered force Aug 2024 | Medium-High (regional) | Jurisdictional limits; enforcement mechanisms untested |
| **Paris AI Action Summit** | Feb 2025 | Low | Called for harmonization but highlighted how far from unified framework |
| **US-China Coordination** | Minimal | Very Low | Fundamental political contradictions; export control tensions |
| **Bletchley/Seoul Summits** | Voluntary commitments | Low-Medium | Non-binding; limited to willing participants |

The <R id="3277a685c8b28fe0">Oxford International Affairs analysis</R> notes that addressing the global AI governance deficit requires moving from a "weak regime complex to the strongest governance system possible under current geopolitical conditions." However, proposals for an "IAEA for AI" face fundamental challenges because "nuclear and AI are not similar policy problems—AI policy is loosely defined with disagreement over field boundaries."

**Critical gap**: Companies plan to scale frontier AI systems 100-1000x in effective compute over the next 3-5 years. Without coordinated international licensing and oversight, countries risk a "regulatory race to the bottom."

## Implementation Roadmap

### Phase 1: Immediate (0-2 years)
- **Redirect 20% of RLHF funding** to interpretability research
- **Establish AI Control research programs** at major labs
- **Implement capability evaluation standards** across industry
- **Strengthen export controls** on AI hardware

### Phase 2: Medium-term (2-5 years)
- **Deploy interpretability tools** for deception detection
- **Pilot AI Control systems** in controlled environments
- **Establish international coordination** mechanisms
- **Develop formal verification** for critical systems

### Phase 3: Long-term (5+ years)
- **Scale proven interventions** to frontier models
- **Implement comprehensive governance** frameworks
- **Address structural risks** through institutional reform
- **Monitor intervention effectiveness** and adapt

## Current State & Trajectory

### Capability Evaluation Ecosystem (2024-2025)

The capability evaluation landscape has matured significantly with multiple organizations now conducting systematic pre-deployment assessments:

| Organization | Role | Key Contributions | Partnerships |
|--------------|------|-------------------|--------------|
| **METR** | Independent evaluator | ARA (autonomous replication & adaptation) methodology; GPT-4, Claude evaluations | UK AISI, Anthropic, OpenAI |
| **Apollo Research** | Scheming detection | Safety cases framework; deception detection | UK AISI, Redwood, UC Berkeley |
| **UK AISI** | Government evaluator | First government-led comprehensive model evaluations | METR, Apollo, major labs |
| **US AISI (NIST)** | Standards development | AI RMF, evaluation guidelines | Industry consortium |
| **Model developers** | Internal evaluation | Pre-deployment testing per RSPs | Third-party auditors |

According to [METR's December 2025 analysis](https://metr.org/blog/2025-12-09-common-elements-of-frontier-ai-safety-policies/), twelve companies have now published frontier AI safety policies, including commitments to capability evaluations for dangerous capabilities before deployment. A 2023 survey found 98% of AI researchers "somewhat or strongly agreed" that labs should conduct pre-deployment risk assessments and dangerous capabilities evaluations.

### Funding Landscape (2024)

| Intervention Type | Annual Funding | Growth Rate | Major Funders |
|------------------|----------------|-------------|---------------|
| **RLHF/Alignment Training** | \$100M+ | 50%/year | <R id="04d39e8bd5d50dd5">OpenAI</R>, <R id="afe2508ac4caf5ee">Anthropic</R>, <R id="0ef9b0fe0f3c92b4">Google DeepMind</R> |
| **Capability Evaluations** | \$150M+ | 80%/year | <R id="817964dfbb0e3b1b">UK AISI</R>, <R id="45370a5153534152">METR</R>, industry labs |
| **Interpretability** | \$100M+ | 60%/year | <R id="f771d4f56ad4dbaa">Anthropic</R>, academic institutions |
| **AI Control** | \$10M+ | 200%/year | <R id="42e7247cbc33fc4c">Redwood Research</R>, academic groups |
| **Governance Research** | \$10M+ | 40%/year | <R id="f35c467b353f990f">GovAI</R>, <R id="f0d95954b449240a">CSET</R> |

### Industry Deployment Status

| Intervention | OpenAI | Anthropic | Google | Meta | Assessment |
|--------------|--------|-----------|--------|------|------------|
| **RLHF** | ✓ Deployed | ✓ Deployed | ✓ Deployed | ✓ Deployed | Standard practice |
| **Constitutional AI** | Partial | ✓ Deployed | Developing | Developing | Emerging standard |
| **Red-teaming** | ✓ Deployed | ✓ Deployed | ✓ Deployed | ✓ Deployed | Universal adoption |
| **Interpretability** | Research | ✓ Active | Research | Limited | Mixed implementation |
| **AI Control** | None | Research | None | None | Early research only |

## Key Cruxes and Expert Disagreements

### High-Confidence Disagreements

| Question | Optimistic View | Pessimistic View | Evidence Quality |
|----------|----------------|------------------|------------------|
| **Will interpretability scale?** | 70% chance of success | 30% chance of success | Medium - early results promising |
| **Is deceptive alignment likely?** | 20% probability | 80% probability | Low - limited empirical data |
| **Can governance keep pace?** | Institutions will adapt | Regulatory capture inevitable | Medium - historical precedent |
| **Are current methods sufficient?** | Incremental progress works | Need paradigm shift | Medium - deployment experience |

### Critical Research Questions

<KeyQuestions
  questions={[
    "Will mechanistic interpretability scale to GPT-4+ sized models?",
    "Can AI Control work against genuinely superintelligent systems?",
    "Are current safety approaches creating a false sense of security?",
    "Which governance interventions are politically feasible before catastrophe?",
    "How do we balance transparency with competitive/security concerns?"
  ]}
/>

### Methodological Limitations

| Limitation | Impact on Analysis | Mitigation Strategy |
|------------|-------------------|-------------------|
| **Sparse empirical data** | Effectiveness estimates uncertain | Expert elicitation, sensitivity analysis |
| **Rapid capability growth** | Intervention relevance changing | Regular reassessment, adaptive frameworks |
| **Novel risk categories** | Matrix may miss emerging threats | Horizon scanning, red-team exercises |
| **Deployment context dependence** | Lab results may not generalize | Real-world pilots, diverse testing |

## Sources & Resources

### Meta-Analyses and Comprehensive Reports

| Report | Authors/Organization | Key Contribution | Date |
|--------|---------------------|------------------|------|
| <R id="b163447fdc804872">International AI Safety Report 2025</R> | 96 experts from 30 countries | Comprehensive assessment that no current method reliably prevents unsafe outputs | 2025 |
| <R id="97185b28d68545b4">AI Safety Index</R> | Future of Life Institute | Quarterly tracking of AI safety progress across multiple dimensions | 2025 |
| <R id="cc9309e5c6d52322">2025 Peregrine Report</R> | 208 expert proposals | In-depth interviews with major AI lab staff on risk mitigation | 2025 |
| <R id="b1d6e7501debf627">Mechanistic Interpretability Review</R> | Bereska & Gavves | Comprehensive review of interpretability for AI safety | 2024 |
| <R id="ce43b69bb5fb00b2">ITU AI Governance Report</R> | International Telecommunication Union | Global state of AI governance | 2025 |

### Primary Research Sources

| Category | Source | Key Contribution | Quality |
|----------|--------|------------------|---------|
| **Technical Safety** | <R id="683aef834ac1612a">Anthropic Constitutional AI</R> | CAI effectiveness data | High |
| **Technical Safety** | <R id="1098fc60be7ca2b0">OpenAI InstructGPT</R> | RLHF deployment evidence | High |
| **Interpretability** | <R id="e724db341d6e0065">Anthropic Scaling Monosemanticity</R> | Interpretability scaling results | High |
| **AI Control** | <R id="187aaa26886ce183">Greenblatt et al. AI Control</R> | Control theory framework | Medium |
| **Evaluations** | <R id="dfeaf87817e20677">METR Dangerous Capabilities</R> | Evaluation methodology | Medium-High |
| **Alignment Faking** | <R id="86fb9322ee1b6a7d">Hubinger et al. 2024</R> | Empirical evidence on backdoor persistence | High |

### Policy and Governance Sources

| Organization | Resource | Focus Area | Reliability |
|--------------|----------|------------|-------------|
| **CSET** | <R id="3e8fecd4ef53888e">AI Governance Database</R> | Policy landscape mapping | High |
| **GovAI** | <R id="571cb6299c6d27cf">Governance research</R> | Institutional analysis | High |
| **RAND Corporation** | <R id="223b829c30beaca2">AI Risk Assessment</R> | Military/security applications | High |
| **UK AISI** | <R id="817964dfbb0e3b1b">Testing reports</R> | Government evaluation practice | Medium-High |
| **US AISI** | <R id="85ee8e554a07476b">Guidelines and standards</R> | Federal AI policy | Medium-High |

### Industry and Lab Resources

| Organization | Resource Type | Key Insights | Access |
|--------------|---------------|--------------|--------|
| **OpenAI** | <R id="838d7a59a02e11a7">Safety research</R> | RLHF deployment data | Public |
| **Anthropic** | <R id="f771d4f56ad4dbaa">Research publications</R> | Constitutional AI, interpretability | Public |
| **DeepMind** | <R id="d451b68232884e88">Safety research</R> | Technical safety approaches | Public |
| **Redwood Research** | <R id="42e7247cbc33fc4c">AI Control research</R> | Control methodology development | Public |
| **METR** | <R id="45370a5153534152">Evaluation frameworks</R> | Capability assessment tools | Partial |

### Expert Survey Data

| Survey | Sample Size | Key Findings | Confidence |
|--------|-------------|--------------|------------|
| **AI Impacts 2022** | 738 experts | Timeline estimates, risk assessments | Medium |
| **FHI Expert Survey** | 352 experts | Existential risk probabilities | Medium |
| **State of AI Report** | Industry data | Deployment and capability trends | High |
| **Anthropic Expert Interviews** | 45 researchers | Technical intervention effectiveness | Medium-High |

### Additional Sources

| Source | URL | Key Contribution |
|--------|-----|------------------|
| Coefficient Giving 2024 Progress | [openphilanthropy.org](https://www.openphilanthropy.org/research/our-progress-in-2024-and-plans-for-2025/) | Funding landscape, priorities |
| AI Safety Funding Analysis | [EA Forum](https://forum.effectivealtruism.org/posts/XdhwXppfqrpPL2YDX/an-overview-of-the-ai-safety-funding-situation) | Comprehensive funding breakdown |
| 80,000 Hours AI Safety | [80000hours.org](https://80000hours.org/2025/01/it-looks-like-there-are-some-good-funding-opportunities-in-ai-safety-right-now/) | Funding opportunities assessment |
| RLHF Alignment Tax | [ACL Anthology](https://aclanthology.org/2024.emnlp-main.35/) | Empirical alignment tax research |
| Safe RLHF | [ICLR 2024](https://proceedings.iclr.cc/paper_files/paper/2024/file/dd1577afd396928ed64216f3f1fd5556-Paper-Conference.pdf) | Helpfulness/harmlessness balance |
| AI Control Paper | [arXiv](https://arxiv.org/pdf/2312.06942) | Foundational AI Control research |
| Redwood Research Blog | [redwoodresearch.substack.com](https://redwoodresearch.substack.com/) | AI Control developments |
| METR Safety Policies | [metr.org](https://metr.org/blog/2025-12-09-common-elements-of-frontier-ai-safety-policies/) | Industry policy analysis |
| GovAI Compute Governance | [governance.ai](https://www.governance.ai/analysis/computing-power-and-the-governance-of-ai) | Compute governance analysis |
| RAND AI Diffusion | [rand.org](https://www.rand.org/pubs/perspectives/PEA3776-1.html) | Export control effectiveness |

## Related Models and Pages

### Technical Risk Models
- <EntityLink id="E94" label="Deceptive Alignment Decomposition" /> - Detailed analysis of key gap
- <EntityLink id="E99" /> - How interventions layer
- <EntityLink id="E53" /> - When interventions become insufficient

### Governance and Strategy
- <EntityLink id="E12" /> - Risk portfolio construction
- <EntityLink id="E51" label="Capabilities to Safety Pipeline" /> - Research translation challenges
- <EntityLink id="E398" /> - Key unknowns affecting prioritization

### Implementation Resources
- <EntityLink id="E252">Responsible Scaling Policies</EntityLink> - Industry implementation
- Safety Research Organizations - Key players and capacity
- <EntityLink id="E447">Evaluation Frameworks</EntityLink> - Assessment methodologies