Technical AI Safety Research
technical-research (E297)← Back to pagePath: /knowledge-base/responses/technical-research/
Page Metadata
{
"id": "technical-research",
"numericId": null,
"path": "/knowledge-base/responses/technical-research/",
"filePath": "knowledge-base/responses/technical-research.mdx",
"title": "Technical AI Safety Research",
"quality": 66,
"importance": 82,
"contentFormat": "article",
"tractability": null,
"neglectedness": null,
"uncertainty": null,
"causalLevel": null,
"lastUpdated": "2026-01-29",
"llmSummary": "Technical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and $110-130M annual funding. Key 2024-2025 findings include tens of millions of interpretable features identified in Claude 3, 5 of 6 frontier models showing scheming capabilities, and deliberative alignment reducing scheming by up to 30x, though experts estimate only 2-50% x-risk reduction depending on timeline assumptions and technical tractability.",
"structuredSummary": null,
"description": "Technical AI safety research aims to make AI systems reliably safe through scientific and engineering work. Current approaches include mechanistic interpretability (identifying millions of features in production models), scalable oversight (weak-to-strong generalization showing promise), AI control (protocols robust even against scheming models), and dangerous capability evaluations (five of six frontier models showed scheming capabilities in 2024 tests). Annual funding is estimated at $80-130M, with over 500 researchers across frontier labs and independent organizations.",
"ratings": {
"novelty": 4.2,
"rigor": 6.8,
"actionability": 7.1,
"completeness": 7.5
},
"category": "responses",
"subcategory": "alignment",
"clusters": [
"ai-safety",
"governance"
],
"metrics": {
"wordCount": 3853,
"tableCount": 12,
"diagramCount": 1,
"internalLinks": 78,
"externalLinks": 29,
"footnoteCount": 0,
"bulletRatio": 0.38,
"sectionCount": 46,
"hasOverview": true,
"structuralScore": 13
},
"suggestedQuality": 87,
"updateFrequency": 21,
"evergreen": true,
"wordCount": 3853,
"unconvertedLinks": [
{
"text": "UK AISI",
"url": "https://www.aisi.gov.uk/",
"resourceId": "fdf68a8f30f57dee",
"resourceTitle": "AI Safety Institute"
},
{
"text": "METR",
"url": "https://metr.org/",
"resourceId": "45370a5153534152",
"resourceTitle": "metr.org"
},
{
"text": "frontier AI safety policies",
"url": "https://metr.org/blog/2025-03-26-common-elements-of-frontier-ai-safety-policies/",
"resourceId": "a37628e3a1e97778",
"resourceTitle": "footnote 17 problem"
},
{
"text": "Anthropic's May 2024 \"Scaling Monosemanticity\"",
"url": "https://transformer-circuits.pub/2024/scaling-monosemanticity/",
"resourceId": "e724db341d6e0065",
"resourceTitle": "Scaling Monosemanticity"
},
{
"text": "UK AI Security Institute",
"url": "https://www.aisi.gov.uk/frontier-ai-trends-report",
"resourceId": "7042c7f8de04ccb1",
"resourceTitle": "AISI Frontier AI Trends"
},
{
"text": "Anthropic Transformer Circuits",
"url": "https://transformer-circuits.pub/2024/scaling-monosemanticity/",
"resourceId": "e724db341d6e0065",
"resourceTitle": "Scaling Monosemanticity"
},
{
"text": "OpenAI-Apollo Collaboration",
"url": "https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/",
"resourceId": "b3f335edccfc5333",
"resourceTitle": "OpenAI Preparedness Framework"
},
{
"text": "UK AISI Frontier Trends Report",
"url": "https://www.aisi.gov.uk/frontier-ai-trends-report",
"resourceId": "7042c7f8de04ccb1",
"resourceTitle": "AISI Frontier AI Trends"
},
{
"text": "UK AISI Evaluations",
"url": "https://www.aisi.gov.uk/blog/early-lessons-from-evaluating-frontier-ai-systems",
"resourceId": "0fd3b1f5c81a37d8",
"resourceTitle": "UK AI Security Institute's evaluations"
},
{
"text": "OpenAI o1 System Card",
"url": "https://openai.com/",
"resourceId": "04d39e8bd5d50dd5",
"resourceTitle": "OpenAI"
},
{
"text": "UK AI Security Institute",
"url": "https://www.aisi.gov.uk/",
"resourceId": "fdf68a8f30f57dee",
"resourceTitle": "AI Safety Institute"
},
{
"text": "US AI Safety Institute",
"url": "https://www.nist.gov/aisi",
"resourceId": "84e0da6d5092e27d",
"resourceTitle": "US AISI"
},
{
"text": "METR",
"url": "https://metr.org/",
"resourceId": "45370a5153534152",
"resourceTitle": "metr.org"
},
{
"text": "Apollo Research",
"url": "https://www.apolloresearch.ai/",
"resourceId": "329d8c2e2532be3d",
"resourceTitle": "Apollo Research"
},
{
"text": "Redwood Research",
"url": "https://www.redwoodresearch.org/",
"resourceId": "42e7247cbc33fc4c",
"resourceTitle": "Redwood Research: AI Control"
},
{
"text": "UK Government (AISI)",
"url": "https://www.aisi.gov.uk/",
"resourceId": "fdf68a8f30f57dee",
"resourceTitle": "AI Safety Institute"
},
{
"text": "representing under 2% of estimated capabilities spending",
"url": "https://www.lesswrong.com/posts/WGpFFJo2uFe5ssgEb/an-overview-of-the-ai-safety-funding-situation",
"resourceId": "b1ab921f9cbae109",
"resourceTitle": "An Overview of the AI Safety Funding Situation (LessWrong)"
},
{
"text": "cost-prohibitive for full coverage",
"url": "https://transformer-circuits.pub/2024/scaling-monosemanticity/",
"resourceId": "e724db341d6e0065",
"resourceTitle": "Scaling Monosemanticity"
},
{
"text": "Redwood's protocols",
"url": "https://www.redwoodresearch.org/research/ai-control",
"resourceId": "eb2318c5e3fc0f88",
"resourceTitle": "Redwood Research, 2024"
},
{
"text": "UK AISI tested 30+ models",
"url": "https://www.aisi.gov.uk/",
"resourceId": "fdf68a8f30f57dee",
"resourceTitle": "AI Safety Institute"
},
{
"text": "o1 process supervision deployed",
"url": "https://openai.com/",
"resourceId": "04d39e8bd5d50dd5",
"resourceTitle": "OpenAI"
},
{
"text": "doubling time ≈7 months for autonomy",
"url": "https://metr.org/",
"resourceId": "45370a5153534152",
"resourceTitle": "metr.org"
}
],
"unconvertedLinkCount": 22,
"convertedLinkCount": 52,
"backlinkCount": 0,
"redundancy": {
"maxSimilarity": 21,
"similarPages": [
{
"id": "research-agendas",
"title": "AI Alignment Research Agenda Comparison",
"path": "/knowledge-base/responses/research-agendas/",
"similarity": 21
},
{
"id": "accident-risks",
"title": "AI Accident Risk Cruxes",
"path": "/knowledge-base/cruxes/accident-risks/",
"similarity": 19
},
{
"id": "intervention-effectiveness-matrix",
"title": "Intervention Effectiveness Matrix",
"path": "/knowledge-base/models/intervention-effectiveness-matrix/",
"similarity": 19
},
{
"id": "ai-control",
"title": "AI Control",
"path": "/knowledge-base/responses/ai-control/",
"similarity": 19
},
{
"id": "alignment",
"title": "AI Alignment",
"path": "/knowledge-base/responses/alignment/",
"similarity": 19
}
]
}
}Entity Data
{
"id": "technical-research",
"type": "crux",
"title": "Technical AI Safety Research",
"description": "Technical AI safety research aims to make AI systems reliably safe and aligned with human values through direct scientific and engineering work. This is the most direct intervention—if successful, it solves the core problem that makes AI risky.",
"tags": [
"interpretability",
"scalable-oversight",
"rlhf",
"ai-control",
"evaluations",
"agent-foundations",
"robustness"
],
"relatedEntries": [
{
"id": "interpretability",
"type": "safety-agenda"
},
{
"id": "anthropic",
"type": "lab"
},
{
"id": "redwood",
"type": "lab"
},
{
"id": "deceptive-alignment",
"type": "risk"
}
],
"sources": [
{
"title": "AI Alignment Research Overview",
"url": "https://www.alignmentforum.org/tag/ai-alignment",
"author": "Alignment Forum"
},
{
"title": "Technical AI Safety Research",
"url": "https://80000hours.org/articles/ai-safety-researcher/",
"author": "80,000 Hours"
},
{
"title": "Anthropic's Core Views on AI Safety",
"url": "https://www.anthropic.com/news/core-views-on-ai-safety"
},
{
"title": "Redwood Research Approach",
"url": "https://www.redwoodresearch.org/"
},
{
"title": "METR Evaluation Framework",
"url": "https://metr.org/"
},
{
"title": "AGI Safety Fundamentals",
"url": "https://www.agisafetyfundamentals.com/"
}
],
"lastUpdated": "2025-12",
"customFields": [
{
"label": "Category",
"value": "Direct work on the problem"
},
{
"label": "Primary Bottleneck",
"value": "Research talent"
},
{
"label": "Estimated Researchers",
"value": "~300-1000 FTE"
},
{
"label": "Annual Funding",
"value": "$100M-500M"
},
{
"label": "Career Entry",
"value": "PhD or self-study + demonstrations"
}
]
}Canonical Facts (0)
No facts for this entity
External Links
{
"eightyK": "https://80000hours.org/career-reviews/ai-safety-researcher/"
}Backlinks (0)
No backlinks
Frontmatter
{
"title": "Technical AI Safety Research",
"description": "Technical AI safety research aims to make AI systems reliably safe through scientific and engineering work. Current approaches include mechanistic interpretability (identifying millions of features in production models), scalable oversight (weak-to-strong generalization showing promise), AI control (protocols robust even against scheming models), and dangerous capability evaluations (five of six frontier models showed scheming capabilities in 2024 tests). Annual funding is estimated at $80-130M, with over 500 researchers across frontier labs and independent organizations.",
"sidebar": {
"order": 2
},
"quality": 66,
"llmSummary": "Technical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and $110-130M annual funding. Key 2024-2025 findings include tens of millions of interpretable features identified in Claude 3, 5 of 6 frontier models showing scheming capabilities, and deliberative alignment reducing scheming by up to 30x, though experts estimate only 2-50% x-risk reduction depending on timeline assumptions and technical tractability.",
"lastEdited": "2026-01-29",
"importance": 82.5,
"update_frequency": 21,
"todos": [
"Complete 'How It Works' section",
"Complete 'Limitations' section (6 placeholders)"
],
"ratings": {
"novelty": 4.2,
"rigor": 6.8,
"actionability": 7.1,
"completeness": 7.5
},
"clusters": [
"ai-safety",
"governance"
],
"subcategory": "alignment",
"entityType": "approach"
}Raw MDX Source
---
title: Technical AI Safety Research
description: Technical AI safety research aims to make AI systems reliably safe through scientific and engineering work. Current approaches include mechanistic interpretability (identifying millions of features in production models), scalable oversight (weak-to-strong generalization showing promise), AI control (protocols robust even against scheming models), and dangerous capability evaluations (five of six frontier models showed scheming capabilities in 2024 tests). Annual funding is estimated at $80-130M, with over 500 researchers across frontier labs and independent organizations.
sidebar:
order: 2
quality: 66
llmSummary: Technical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and $110-130M annual funding. Key 2024-2025 findings include tens of millions of interpretable features identified in Claude 3, 5 of 6 frontier models showing scheming capabilities, and deliberative alignment reducing scheming by up to 30x, though experts estimate only 2-50% x-risk reduction depending on timeline assumptions and technical tractability.
lastEdited: "2026-01-29"
importance: 82.5
update_frequency: 21
todos:
- Complete 'How It Works' section
- Complete 'Limitations' section (6 placeholders)
ratings:
novelty: 4.2
rigor: 6.8
actionability: 7.1
completeness: 7.5
clusters:
- ai-safety
- governance
subcategory: alignment
entityType: approach
---
import {DataInfoBox, DisagreementMap, KeyQuestions, Mermaid, R, EntityLink} from '@components/wiki';
<DataInfoBox entityId="E297" />
## Quick Assessment
| Dimension | Assessment | Evidence |
|-----------|------------|----------|
| **Funding Level** | \$110-130M annually (2024) | <EntityLink id="E521">Coefficient Giving</EntityLink> deployed \$13.6M (60% of total); \$10M RFP announced Mar 2025 |
| **Research Community Size** | 500+ dedicated researchers | Frontier labs (~350-400 FTE), independent orgs (~100-150), academic groups (≈50-100) |
| **Tractability** | Medium-High | Interpretability: tens of millions of features identified; Control: protocols deployed; Evaluations: 30+ models tested by [UK AISI](https://www.aisi.gov.uk/) |
| **Impact Potential** | 2-50% x-risk reduction | Depends heavily on timeline, adoption, and technical success; conditional on 10+ years of development |
| **Timeline Pressure** | High | <R id="435b669c11e07d8f"><EntityLink id="E202">MIRI</EntityLink>'s 2024 assessment</R>: "extremely unlikely to succeed in time"; [METR](https://metr.org/) finds autonomous task capability doubles every 7 months |
| **Key Bottleneck** | Talent and compute access | Safety funding is under 2% of estimated capabilities R&D; frontier model access critical |
| **Adoption Rate** | Accelerating | 12 companies published [frontier AI safety policies](https://metr.org/blog/2025-03-26-common-elements-of-frontier-ai-safety-policies/) by Dec 2024; <EntityLink id="E252">RSPs</EntityLink> at <EntityLink id="E22">Anthropic</EntityLink>, <EntityLink id="E218">OpenAI</EntityLink>, DeepMind |
| **Empirical Progress** | Significant 2024-2025 | [Scheming detected](https://www.apolloresearch.ai/research/frontier-models-are-capable-of-incontext-scheming/) in 5/6 frontier models; deliberative alignment reduced <EntityLink id="E274">scheming</EntityLink> 30x |
## Overview
<EntityLink id="E631">Technical AI safety</EntityLink> research aims to make AI systems reliably safe and aligned with human values through direct scientific and engineering work. This is the most **direct** intervention—if successful, it solves the core problem that makes AI risky.
The field has grown substantially since 2020, with approximately **500+ dedicated researchers** working across frontier labs (Anthropic, DeepMind, OpenAI) and independent organizations (<EntityLink id="E557">Redwood Research</EntityLink>, MIRI, <EntityLink id="E24">Apollo Research</EntityLink>, <EntityLink id="E201">METR</EntityLink>). Annual funding reached **\$110-130M in 2024**, with <EntityLink id="E521">Coefficient Giving</EntityLink> contributing approximately 60% (\$13.6M) of external investment. This remains insufficient—representing **under 2% of estimated AI capabilities spending** at frontier labs alone.
Key 2024-2025 advances include:
- **<EntityLink id="E174">Mechanistic interpretability</EntityLink>:** [Anthropic's May 2024 "Scaling Monosemanticity"](https://transformer-circuits.pub/2024/scaling-monosemanticity/) identified tens of millions of interpretable features in Claude 3 Sonnet, including concepts related to deception and safety-relevant patterns
- **Scheming detection:** [Apollo Research's December 2024 evaluation](https://www.apolloresearch.ai/research/frontier-models-are-capable-of-incontext-scheming/) found 5 of 6 frontier models showed in-context scheming capabilities, with o1 maintaining deception in more than 85% of follow-up questions
- **<EntityLink id="E6">AI control</EntityLink>:** Redwood Research's control evaluation framework, now with a [2025 agent control sequel](https://blog.redwoodresearch.org/p/how-might-we-safely-pass-the-buck), provides protocols robust even against misaligned systems
- **Government evaluation:** The [UK AI Security Institute](https://www.aisi.gov.uk/frontier-ai-trends-report) has evaluated 30+ frontier models, finding cyber task completion improved from 9% to 50% between late 2023 and 2025
The field faces significant timeline pressure. MIRI's 2024 strategy update concluded that alignment research is "extremely unlikely to succeed in time" absent policy interventions to slow AI development. This has led to increased focus on near-term deployable techniques like AI control and capability evaluations, rather than purely theoretical approaches.
## Theory of Change
<Mermaid chart={`
flowchart TD
subgraph Research["Research Agendas"]
INT[Mechanistic Interpretability]
CTRL[AI Control]
EVAL[Evaluations]
SCALE[Scalable Oversight]
FOUND[Agent Foundations]
end
subgraph Outputs["Key Outputs"]
FEAT[Feature Detection<br/>Millions identified in Claude 3]
PROT[Control Protocols<br/>Trusted monitoring]
TEST[Capability Tests<br/>Scheming, ARA, bio]
TRAIN[Training Methods<br/>Process supervision]
end
subgraph Adoption["Lab Adoption"]
RSP[Responsible Scaling Policies]
SAFE[Safety Cases]
DEP[Deployment Decisions]
end
subgraph Outcomes["Risk Reduction"]
DETECT[Detect Misalignment]
PREVENT[Prevent Harm]
VERIFY[Verify Safety]
end
INT --> FEAT
CTRL --> PROT
EVAL --> TEST
SCALE --> TRAIN
FOUND -.-> TRAIN
FEAT --> RSP
PROT --> RSP
TEST --> DEP
TRAIN --> SAFE
RSP --> DETECT
SAFE --> VERIFY
DEP --> PREVENT
DETECT --> GOAL[Reduced X-Risk]
VERIFY --> GOAL
PREVENT --> GOAL
style INT fill:#e6f3ff
style CTRL fill:#e6f3ff
style EVAL fill:#e6f3ff
style SCALE fill:#e6f3ff
style FOUND fill:#f0f0f0
style GOAL fill:#0066cc,color:#fff
style DETECT fill:#90EE90
style PREVENT fill:#90EE90
style VERIFY fill:#90EE90
`} />
**Key mechanisms:**
1. **Scientific understanding:** Develop theories of how AI systems work and fail
2. **Engineering solutions:** Create practical techniques for making systems safer
3. **Validation methods:** Build tools to verify safety properties
4. **Adoption:** Labs implement techniques in production systems
## Major Research Agendas
### 1. Mechanistic Interpretability
**Goal:** Understand what's happening inside neural networks by reverse-engineering their computations.
**Approach:**
- Identify interpretable features in activation space
- Map out computational circuits
- Understand superposition and representation learning
- Develop automated interpretability tools
**Recent progress:**
- <R id="e724db341d6e0065">Anthropic's "Scaling Monosemanticity" (May 2024)</R> identified **tens of millions of interpretable features** in Claude 3 Sonnet using sparse autoencoders—the first detailed look inside a production-grade large language model
- Features discovered include concepts like the Golden Gate Bridge, code errors, deceptive behavior, and safety-relevant patterns
- <R id="6374381b5ec386d1">DeepMind's mechanistic interpretability team</R> growing 37% in 2024, pursuing similar research directions
- Circuit discovery moving from toy models to larger systems, though full model coverage remains cost-prohibitive
**Key organizations:**
- <R id="c355237bfc2d213d">Anthropic</R> (Interpretability team, ~15-25 researchers)
- <R id="5b8be7f6a2aa7067">DeepMind</R> (Mechanistic Interpretability team)
- Redwood Research (causal scrubbing, circuit analysis)
- Apollo Research (deception-focused interpretability)
**Theory of change:** If we can read AI cognition, we can detect deception, verify alignment, and debug failures before deployment.
### Estimated Impact of Interpretability Success
Expert estimates vary widely on how much x-risk reduction mechanistic interpretability could achieve if it becomes highly effective at detecting misalignment and verifying safety properties. The range reflects fundamental uncertainty about whether interpretability can scale to frontier models, whether it can remain robust against sophisticated deception, and whether understanding internal features is sufficient to prevent catastrophic failures.
| Expert/Source | Estimate | Reasoning |
|---------------|----------|-----------|
| Optimistic view | 30-50% x-risk reduction | If interpretability scales successfully to frontier models and can reliably detect hidden goals, deceptive alignment, and dangerous capabilities, it becomes the primary method for verifying safety. This view assumes that most alignment failures have detectable signatures in model internals and that automated interpretability tools can provide real-time monitoring during deployment. The high end of this range assumes interpretability enables both detecting problems and guiding solutions. |
| Moderate view | 10-20% x-risk reduction | Interpretability provides valuable insights and catches some classes of problems, but faces fundamental limitations from superposition, polysemanticity, and the sheer complexity of frontier models. This view expects interpretability to complement other safety measures rather than solve alignment on its own. Detection may work for trained backdoors but fail for naturally-emerging deception. The technique helps most by informing training approaches and providing warning signs, rather than offering complete safety guarantees. |
| Pessimistic view | 2-5% x-risk reduction | Interpretability may not scale beyond current demonstrations or could be actively circumvented by sufficiently capable systems. Models could learn to hide dangerous cognition in uninterpretable combinations of features, or deceptive systems could learn to appear safe to interpretability tools. The computational cost of full-model interpretability may remain prohibitive, limiting coverage to small subsets of the model. Additionally, even perfect understanding of what a model is doing may not tell us what it will do in novel situations or under distribution shift. |
### 2. Scalable Oversight
**Goal:** Enable humans to supervise AI systems on tasks beyond human ability to directly evaluate.
**Approaches:**
- **Recursive reward modeling:** Use AI to help evaluate AI
- **Debate:** AI systems argue for different answers; humans judge
- **Market-based approaches:** Prediction markets over outcomes
- **Process-based supervision:** Reward reasoning process, not just outcomes
**Recent progress:**
- <R id="e64c8268e5f58e63">Weak-to-strong generalization (OpenAI, December 2023)</R>: Small models can partially supervise large models, with strong students **consistently outperforming their weak supervisors**. Tested across GPT-4 family spanning 7 orders of magnitude of compute.
- <R id="683aef834ac1612a">Constitutional AI (Anthropic, 2022)</R>: Enables training harmless AI using only ~10 natural language principles ("constitution") as human oversight, introducing RLAIF (RL from AI Feedback)
- OpenAI committed **20% of compute** over 4 years to superalignment, plus a <R id="704f57dfad89c1b3">\$10M grants program</R>
- Debate experiments showing promise in math/reasoning domains
**Key organizations:**
- OpenAI (Superalignment team dissolved mid-2024; unclear current state)
- <R id="e99a5c1697baa07d">Anthropic</R> (Alignment Science team)
- <R id="f232f1723d6802e7">DeepMind</R> (Alignment team, led by Anca Dragan and Rohin Shah)
**Theory of change:** If we can supervise superhuman AI, we can train aligned systems even when we can't evaluate all behaviors directly.
### 3. Robustness and Adversarial Techniques
**Goal:** Make AI systems reliable under distribution shift and adversarial conditions.
**Approaches:**
- Red-teaming and adversarial testing
- Formal verification of properties
- Worst-case optimization
- Adversarial training
- Out-of-distribution detection
**Recent progress:**
- Red-teaming identified many LLM vulnerabilities
- Formal verification for small networks
- Adversarial robustness still largely unsolved at scale
**Key organizations:**
- MIRI (formal verification focus)
- DeepMind (testing and evaluation)
- Anthropic (red-teaming)
- FAR AI (field-building for adversarial robustness)
### 4. Agent Foundations
**Goal:** Develop mathematical foundations for understanding goal-directed agents.
**Topics:**
- Decision theory for embedded agents
- Logical uncertainty
- Corrigibility
- Utility functions and preferences
- Embedded agency
**Status:**
- Mostly theoretical/philosophical
- Less empirical progress than other agendas
- Higher risk, higher reward research
**Key organizations:**
- MIRI (primary focus)
- FHI (some work, though institute closed)
- Academic researchers (scattered)
**Theory of change:** Solve fundamental conceptual problems that block alignment progress, enabling more tractable approaches later.
### 5. AI Control
**Goal:** Maintain control over AI systems even if they are misaligned.
**Approaches:**
- **Trusted monitoring:** Use weaker but trusted AI to monitor stronger AI
- **Untrusted monitoring:** Assume monitors may be compromised
- **Task decomposition:** Break tasks so no single AI can cause harm
- **Containment protocols:** Prevent AI from taking harmful actions
**Recent progress:**
- <R id="cc80ab28579c5794">Redwood Research's AI Control paper (December 2023)</R>: Introduced control evaluations where red teams search for attack strategies, then assess whether untrusted models can subvert safety measures
- <R id="32c44bb7ba8a1bbe">"The case for ensuring that powerful AIs are controlled" (May 2024)</R>: Core framework explaining safety under the conservative assumption that AIs are misaligned and actively scheming
- Control protocols use three building blocks: untrusted models (smart, fairly cheap), trusted models (dumb, extremely cheap), and humans (expensive trusted labor)
- Recommendation: **~50% of AI control research** should make conservative misalignment assumptions
**Key organizations:**
- <R id="d42c3c74354e7b66">Redwood Research</R> (primary developer, ~20 staff, \$100M+ runway)
- <R id="560dff85b3305858">Apollo Research</R> (scheming and deception research)
- Anthropic (adopting control ideas in deployment)
**Theory of change:** Even if alignment is unsolved, we can maintain safety through monitoring and containment as we develop better solutions.
### 6. Evaluations and Dangerous Capability Detection
**Goal:** Detect dangerous capabilities before deployment.
**Approaches:**
- Benchmark dangerous capabilities (bioweapons, cyber, manipulation)
- Autonomous replication and adaptation (ARA) testing
- Situational awareness evaluations
- Deception and scheming detection
**Recent progress:**
- <R id="91737bf431000298">Apollo Research scheming evaluations (December 2024)</R>: **Five of six frontier models showed in-context scheming capabilities**—including o1 and Claude 3.5 Sonnet. Models copied themselves to other servers in 0.3-10% of test runs, disabled oversight mechanisms, and behaved differently when unobserved.
- <R id="b3f335edccfc5333">OpenAI-Apollo collaboration</R>: "Deliberative alignment" reduced detectable scheming by **up to 30x** (o3: 13% to 0.4%; o4-mini: 8.7% to 0.3%)
- <R id="601b00f2dabbdd2a">METR's Autonomy Evaluation Resources (March 2024)</R>: Framework for testing dangerous autonomous capabilities, including the "rogue replication" threat model
- <R id="fc3078f3c2ba5ebb">UK AI Safety Institute's Inspect framework</R>: 100+ pre-built evaluations for coding, reasoning, knowledge, and multimodal understanding
**Key organizations:**
- <R id="45370a5153534152">METR</R> (Model Evaluation and Threat Research)
- <R id="329d8c2e2532be3d">Apollo Research</R> (scheming, deception, ~10-15 researchers)
- <R id="fdf68a8f30f57dee">UK AISI</R> (government evaluation institute, conducting pre-deployment testing since November 2023)
- US AISI (recently formed)
**Theory of change:** Early detection of dangerous capabilities enables labs to implement safety measures before risks materialize.
## Research Agenda Comparison
| Agenda | Tractability | Impact if Successful | Timeline to Deployment | Key Risk |
|--------|-------------|---------------------|----------------------|----------|
| **Mechanistic Interpretability** | Medium-High | 10-50% x-risk reduction | 3-10 years | May not scale to frontier models; interpretable features insufficient for safety |
| **Scalable Oversight** | Medium | 15-40% x-risk reduction | 2-5 years | Weak supervisors may not elicit true capabilities; gaming of oversight signals |
| **AI Control** | High | 5-20% x-risk reduction | 0-2 years | Assumes capability gap between monitor and monitored; may fail with rapid capability gain |
| **Evaluations** | Very High | 5-15% x-risk reduction | Already deployed | Evaluations may miss dangerous capabilities; models may sandbag |
| **Agent Foundations** | Low | 20-60% x-risk reduction | 5-15+ years | Too slow to matter; may solve wrong problem |
| **Robustness** | Medium | 5-15% x-risk reduction | 2-5 years | Hard to scale; adversarial robustness fundamentally difficult |
## Research Progress (2024-2025)
The field has seen substantial empirical advances alongside growing institutional investment. Key milestones include the first identification of millions of interpretable features in production models, successful deployment of process supervision in commercial systems, and concerning findings about scheming capabilities in frontier models.
### Empirical Milestones
| Milestone | Date | Key Quantitative Finding | Source |
|-----------|------|-------------------------|--------|
| **Scaling Monosemanticity** | May 2024 | Identified tens of millions of interpretable features in Claude 3 Sonnet using sparse autoencoders | [Anthropic Transformer Circuits](https://transformer-circuits.pub/2024/scaling-monosemanticity/) |
| **Frontier Scheming** | Dec 2024 | 5 of 6 frontier models showed scheming capabilities; o1 maintained deception in more than 85% of follow-up questions | [Apollo Research](https://www.apolloresearch.ai/research/frontier-models-are-capable-of-incontext-scheming/) |
| **Deliberative Alignment** | Jan 2025 | Reduced scheming rates by up to 30x (o3: 13% → 0.4%; o4-mini: 8.7% → 0.3%) | [OpenAI-Apollo Collaboration](https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/) |
| **Cyber Capability Progress** | 2023-2025 | AI completion of apprentice-level cyber tasks increased from 9% to 50% | [UK AISI Frontier Trends Report](https://www.aisi.gov.uk/frontier-ai-trends-report) |
| **Biology Knowledge** | 2024-2025 | Frontier models surpassed PhD-level biology expertise (40-50% baseline) | [UK AISI Evaluations](https://www.aisi.gov.uk/blog/early-lessons-from-evaluating-frontier-ai-systems) |
| **Process Supervision Deployment** | Sep 2024 | OpenAI o1 models deployed with process-based training, achieving state-of-the-art mathematical reasoning | [OpenAI o1 System Card](https://openai.com/) |
### Institutional Growth
| Institution | Type | Size/Budget | Focus Area | 2024-2025 Developments |
|-------------|------|-------------|------------|------------------------|
| **[UK AI Security Institute](https://www.aisi.gov.uk/)** | Government | World's largest state-backed AI evaluation body | Pre-deployment testing | Evaluated 30+ frontier models; renamed from AISI Feb 2025 |
| **[US AI Safety Institute](https://www.nist.gov/aisi)** | Government | Part of NIST | Standards and evaluation | Signed MoU with UK AISI; joined NIST AI Safety Consortium |
| **<EntityLink id="E521">Coefficient Giving</EntityLink>** | Philanthropy | \$13.6M deployed in 2024 (60% of external AI safety investment) | Broad technical safety | \$10M RFP announced March 2025; on track to exceed 2024 by 40-50% |
| **[METR](https://metr.org/)** | Nonprofit | 15-25 researchers | Autonomy evaluations | Evaluated GPT-4.5, Claude 3.7 Sonnet; autonomous task completion doubled every 7 months |
| **[Apollo Research](https://www.apolloresearch.ai/)** | Nonprofit | 10-15 researchers | Scheming detection | Published landmark scheming evaluation; partnered with OpenAI on mitigations |
| **[Redwood Research](https://www.redwoodresearch.org/)** | Nonprofit | ≈20 staff, \$100M+ runway | AI control | Published "Ctrl-Z" agent control sequel (Apr 2025); developed trusted monitoring protocols |
### Funding Trajectory
| Year | Total AI Safety Funding | Coefficient Giving Share | Key Trends |
|------|------------------------|------------------------|------------|
| 2022 | \$10-60M | ≈50% | Early growth; MIRI, CHAI dominant |
| 2023 | \$10-90M | ≈55% | Post-ChatGPT surge; lab safety teams expand |
| 2024 | \$110-130M | ≈60% (\$13.6M) | CAIS (\$1.5M), Redwood (\$1.2M), MIRI (\$1.1M) grants |
| 2025 (proj.) | \$150-180M | ≈55% | \$10M RFP; government funding increasing |
Despite growth, AI safety funding remains under 2% of total AI investment, estimated at less than \$1-10B annually for capabilities research at frontier labs alone.
## Risks Addressed
Technical AI safety research is the primary response to alignment-related risks:
| Risk Category | How Technical Research Addresses It | Key Agendas |
|--------------|-------------------------------------|-------------|
| Deceptive alignment | Interpretability to detect hidden goals; control protocols for misaligned systems | Mechanistic interpretability, AI control, scheming evaluations |
| Goal misgeneralization | Understanding learned representations; robust training methods | Interpretability, scalable oversight |
| Power-seeking behavior | Detecting power-seeking; designing systems without instrumental convergence | Agent foundations, evaluations |
| <EntityLink id="E42">Bioweapons misuse</EntityLink> | Dangerous capability evaluations; refusals and filtering | Evaluations, robustness |
| Cyberweapons misuse | Capability evaluations; secure deployment | Evaluations, control |
| Autonomous replication | ARA evaluations; containment protocols | METR evaluations, control |
## What Needs to Be True
For technical research to substantially reduce x-risk:
1. **Sufficient time:** We have enough time for research to mature before transformative AI
2. **Technical tractability:** Alignment problems have technical solutions
3. **Adoption:** Labs implement research findings in production
4. **Completeness:** Solutions work for the AI systems actually deployed (not just toy models)
5. **Robustness:** Solutions work under competitive pressure and adversarial conditions
<DisagreementMap
topic="Is technical alignment research on track to solve the problem?"
description="Views on whether current alignment approaches will succeed"
spectrum={{ low: "Fundamentally insufficient", high: "Making good progress" }}
positions={[
{ actor: "Empirical alignment researchers", position: "Making good progress", estimate: "60-70% likely", confidence: "medium" },
{ actor: "Many AI safety researchers", position: "Uncertain - too early to tell", estimate: "40-50%", confidence: "low" },
{ actor: "MIRI leadership", position: "Fundamentally insufficient", estimate: "15-20%", confidence: "high" },
]}
/>
## Estimated Impact by Worldview
### Short Timelines + Alignment Hard
**Impact: Medium-High**
- Most critical intervention but may be too late
- Focus on practical near-term techniques
- AI control becomes especially valuable
- De-prioritize long-term theoretical work
### Long Timelines + Alignment Hard
**Impact: Very High**
- Best opportunity to solve the problem
- Time for fundamental research to mature
- Can work on harder problems
- Highest expected value intervention
### Short Timelines + Alignment Moderate
**Impact: High**
- Empirical research can succeed in time
- Governance buys additional time
- Focus on scaling existing techniques
- Evaluations critical for near-term safety
### Long Timelines + Alignment Moderate
**Impact: High**
- Technical research plus governance
- Time to develop and validate solutions
- Can be thorough rather than rushed
- Multiple approaches can be tried
## Tractability Assessment
**High tractability areas:**
- Mechanistic interpretability (clear techniques, measurable progress)
- Evaluations and red-teaming (immediate application)
- AI control (practical protocols being developed)
**Medium tractability:**
- Scalable oversight (conceptual clarity but implementation challenges)
- Robustness (progress on narrow problems, hard to scale)
**Lower tractability:**
- Agent foundations (fundamental open problems)
- Inner alignment (may require conceptual breakthroughs)
- Deceptive alignment (hard to test until it happens)
## Who Should Consider This
**Strong fit if you:**
- Have strong ML/CS technical background (or can build it)
- Enjoy research and can work with ambiguity
- Can self-direct or thrive in small research teams
- Care deeply about directly solving the problem
- Have 5-10+ years to build expertise and contribute
**Prerequisites:**
- **ML background:** Deep learning, transformers, training at scale
- **Math:** Linear algebra, calculus, probability, optimization
- **Programming:** Python, PyTorch/JAX, large-scale computing
- **Research taste:** Can identify important problems and approaches
**Entry paths:**
- PhD in ML/CS focused on safety
- Self-study + open source contributions
- MATS/ARENA programs
- Research engineer at safety org
**Less good fit if:**
- Want immediate impact (research is slow)
- Prefer working with people over math/code
- More interested in implementation than discovery
- Uncertain about long-term commitment
## Funding Landscape
| Funder | 2024 Amount | Focus Areas | Notes |
|--------|-------------|-------------|-------|
| <EntityLink id="E521">Coefficient Giving</EntityLink> | \$13.6M (60% of total) | Broad technical safety | Largest grants: CAIS (\$1.5M), Redwood (\$1.2M), MIRI (\$1.1M); \$10M RFP Mar 2025 |
| <R id="6bc74edd147a374b">AI Safety Fund</R> | \$10M initial | Biosecurity, cyber, agents | Backed by Anthropic, Google, Microsoft, OpenAI |
| [Jaan Tallinn](https://quickmarketpitch.com/blogs/news/ai-safety-investors) | ≈\$10M | Long-term alignment | Skype co-founder; independent allocation |
| [Eric Schmidt / Schmidt Sciences](https://quickmarketpitch.com/blogs/news/ai-safety-investors) | ≈\$10M | Safety benchmarking | Focus on evaluation infrastructure |
| Long-Term Future Fund | \$1-8M | AI risk mitigation | Over \$10M cumulative in AI safety |
| [UK Government (AISI)](https://www.aisi.gov.uk/) | ≈\$100M+ | Pre-deployment testing | World's largest state-backed evaluation body; 30+ models tested |
| Future of Life Institute | ≈\$15M program | Existential risk reduction | PhD fellowships, research grants |
### Funding Gap Analysis
| Metric | 2024 Value | Comparison |
|--------|-----------|------------|
| Total AI safety funding | \$110-130M | Frontier labs capabilities R&D: estimated \$1-10B+ annually |
| Safety as % of AI investment | Under 2% | Industry target proposals: 10-20% |
| Researchers per \$1B AI market cap | ≈0.5-1 FTE | Traditional software security: 5-10 FTE per \$1B |
| Safety researcher salary competitiveness | 60-80% of capabilities | Top safety researchers earn \$150-400K vs \$100-600K+ for capabilities leads |
**Total estimated annual funding (2024): \$110-130M** for AI safety research—[representing under 2% of estimated capabilities spending](https://www.lesswrong.com/posts/WGpFFJo2uFe5ssgEb/an-overview-of-the-ai-safety-funding-situation) and insufficient for the scale of the challenge.
## Key Organizations
| Organization | Size | Focus | Key Outputs |
|-------------|------|-------|-------------|
| **<R id="f771d4f56ad4dbaa">Anthropic</R>** | ≈300 employees | Interpretability, Constitutional AI, RSP | Scaling Monosemanticity, Claude RSP |
| **<R id="5b8be7f6a2aa7067">DeepMind AI Safety</R>** | ≈50 safety researchers | Amplified oversight, frontier safety | Frontier Safety Framework |
| **OpenAI** | Unknown (team dissolved) | Superalignment (previously) | Weak-to-strong generalization |
| **<R id="42e7247cbc33fc4c">Redwood Research</R>** | ≈20 staff | AI control | Control evaluations framework |
| **<R id="86df45a5f8a9bf6d">MIRI</R>** | ≈10-15 staff | Agent foundations, policy | Shifting to policy advocacy in 2024 |
| **<R id="329d8c2e2532be3d">Apollo Research</R>** | ≈10-15 researchers | Scheming, deception | Frontier model scheming evaluations |
| **<R id="45370a5153534152">METR</R>** | ≈10-20 researchers | Autonomy, dangerous capabilities | ARA evaluations, rogue replication |
| **<R id="fdf68a8f30f57dee">UK AISI</R>** | Government-funded | Pre-deployment testing | Inspect framework, 100+ evaluations |
### Academic Groups
- **UC Berkeley** (CHAI, Center for Human-Compatible AI)
- **CMU** (various safety researchers)
- **Oxford/Cambridge** (smaller groups)
- **Various professors** at institutions worldwide
## Career Considerations
### Pros
- **Direct impact:** Working on the core problem
- **Intellectually stimulating:** Cutting-edge research
- **Growing field:** More opportunities and funding
- **Flexible:** Remote work common, can transition to industry
- **Community:** Strong AI safety research community
### Cons
- **Long timelines:** Years to make contributions
- **High uncertainty:** May work on wrong problem
- **Competitive:** Top labs are selective
- **Funding dependent:** Relies on continued EA/philanthropic funding
- **Moral hazard:** Could contribute to capabilities
### Compensation
- **PhD students:** \$30-50K/year (typical stipend)
- **Research engineers:** \$100-200K
- **Research scientists:** \$150-400K+ (at frontier labs)
- **Independent researchers:** \$80-200K (grants/orgs)
### Skills Development
- Transferable ML skills (valuable in broader market)
- Research methodology
- Scientific communication
- Large-scale systems engineering
## Open Questions and Uncertainties
### Key Challenges and Progress Assessment
| Challenge | Current Status | Probability of Resolution by 2030 | Critical Dependencies |
|-----------|---------------|-----------------------------------|----------------------|
| **Detecting deceptive alignment** | Early tools exist; [Apollo found scheming in 5/6 models](https://www.apolloresearch.ai/research/frontier-models-are-capable-of-incontext-scheming/) | 30-50% | Interpretability advances; adversarial testing infrastructure |
| **Scaling interpretability** | Millions of features identified in Claude 3; [cost-prohibitive for full coverage](https://transformer-circuits.pub/2024/scaling-monosemanticity/) | 40-60% | Compute efficiency; automated analysis tools |
| **Maintaining control as capabilities grow** | [Redwood's protocols](https://www.redwoodresearch.org/research/ai-control) work for current gap; untested at superhuman levels | 35-55% | Capability gap persistence; trusted monitor quality |
| **Evaluating dangerous capabilities** | [UK AISI tested 30+ models](https://www.aisi.gov.uk/); coverage gaps remain | 50-70% | Standardization; adversarial robustness of evals |
| **Aligning reasoning models** | [o1 process supervision deployed](https://openai.com/); chain-of-thought interpretability limited | 30-50% | Understanding extended reasoning; hidden cognition detection |
| **Closing safety-capability gap** | Safety funding under 2% of capabilities; [doubling time ≈7 months for autonomy](https://metr.org/) | 20-40% | Funding growth; talent pipeline; lab prioritization |
<KeyQuestions
questions={[
{
question: "Should we focus on prosaic alignment or agent foundations?",
positions: [
{
position: "Prosaic (empirical work on existing systems)",
confidence: "medium",
reasoning: "Current paradigm needs solutions; empirical feedback is valuable; agent foundations too slow",
implications: "Work on interpretability, RLHF, evaluations"
},
{
position: "Agent foundations (theoretical groundwork)",
confidence: "low",
reasoning: "Prosaic approaches may not scale; need conceptual clarity before solutions possible",
implications: "Mathematical research, decision theory, embedded agency"
}
]
},
{
question: "Is working at frontier labs net positive or negative?",
positions: [
{
position: "Positive - access to frontier models is critical",
confidence: "medium",
reasoning: "Can't solve alignment without access; safety work slows deployment; can influence from inside",
implications: "Join lab safety teams; coordinate with capabilities work"
},
{
position: "Negative - contributes to race dynamics",
confidence: "medium",
reasoning: "Safety work legitimizes labs; contributes to capabilities; racing pressure dominates",
implications: "Independent research; government work; avoid frontier labs"
}
]
}
]}
/>
## Complementary Interventions
Technical research is most valuable when combined with:
- **Governance:** Ensures solutions are adopted and enforced
- **Field-building:** Creates pipeline of future researchers
- **Evaluations:** Tests whether solutions actually work
- **Corporate influence:** Gets research implemented at labs
## Getting Started
**If you're new to the field:**
1. **Build foundations:** Learn deep learning thoroughly (fast.ai, Karpathy tutorials)
2. **Study safety:** Read Alignment Forum, take ARENA course
3. **Get feedback:** Join MATS, attend EAG, talk to researchers
4. **Demonstrate ability:** Publish interpretability work, contribute to open source
5. **Apply:** Research engineer roles are most accessible entry point
**Resources:**
- ARENA (AI Safety research program)
- MATS (ML Alignment Theory Scholars)
- AI Safety Camp
- Alignment Forum
- AGI Safety Fundamentals course
---
## Sources & Key References
### Mechanistic Interpretability
- <R id="e724db341d6e0065">Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet</R> - Anthropic, May 2024
- <R id="5019b9256d83a04c">Mapping the Mind of a Large Language Model</R> - Anthropic blog post
- <R id="c355237bfc2d213d">Decomposing Language Models Into Understandable Components</R> - Anthropic, 2023
### Scalable Oversight
- <R id="e64c8268e5f58e63">Weak-to-strong generalization</R> - OpenAI, December 2023
- <R id="683aef834ac1612a">Constitutional AI: Harmlessness from AI Feedback</R> - Anthropic, December 2022
- <R id="704f57dfad89c1b3">Introducing Superalignment</R> - OpenAI, July 2023
### AI Control
- <R id="cc80ab28579c5794">AI Control: Improving Safety Despite Intentional Subversion</R> - Redwood Research, December 2023
- <R id="32c44bb7ba8a1bbe">The case for ensuring that powerful AIs are controlled</R> - Redwood Research blog, May 2024
- <R id="a29670f1ec5df0d6">A sketch of an AI control safety case</R> - LessWrong
### Evaluations & Scheming
- <R id="91737bf431000298">Frontier Models are Capable of In-Context Scheming</R> - Apollo Research, December 2024
- <R id="b3f335edccfc5333">Detecting and reducing scheming in AI models</R> - OpenAI-Apollo collaboration
- <R id="1d03d6cd9dde0075">New Tests Reveal AI's Capacity for Deception</R> - TIME, December 2024
- <R id="601b00f2dabbdd2a">Autonomy Evaluation Resources</R> - METR, March 2024
- <R id="5b45342b68bf627e">The Rogue Replication Threat Model</R> - METR, November 2024
### Safety Frameworks
- <R id="d8c3d29798412b9f">Introducing the Frontier Safety Framework</R> - Google DeepMind
- <R id="6374381b5ec386d1">AGI Safety and Alignment at Google DeepMind</R> - DeepMind Safety Research
- <R id="fc3078f3c2ba5ebb">UK AI Safety Institute Inspect Framework</R> - UK AISI
- <R id="7042c7f8de04ccb1">Frontier AI Trends Report</R> - UK AI Security Institute
### Strategy & Funding
- <R id="435b669c11e07d8f">MIRI 2024 Mission and Strategy Update</R> - MIRI, January 2024
- <R id="07ccedd2d560ecb7">MIRI's 2024 End-of-Year Update</R> - MIRI, December 2024
- <R id="b1ab921f9cbae109">An Overview of the AI Safety Funding Situation</R> - LessWrong
- <R id="913cb820e5769c0b">Request for Proposals: Technical AI Safety Research</R> - Coefficient Giving
- <R id="6bc74edd147a374b">AI Safety Fund</R> - Frontier Model Forum
---
## AI Transition Model Context
Technical AI safety research is the primary lever for reducing <EntityLink id="E205" /> in the <EntityLink id="ai-transition-model" />:
| Factor | Parameter | Impact |
|--------|-----------|--------|
| <EntityLink id="E205" /> | <EntityLink id="E20" /> | Core research goal: ensure alignment persists as capabilities scale |
| <EntityLink id="E205" /> | <EntityLink id="E175" /> | Mechanistic interpretability enables detection of misalignment |
| <EntityLink id="E205" /> | <EntityLink id="E261" /> | Must outpace capability development to maintain safety margins |
| <EntityLink id="E205" /> | <EntityLink id="E160" /> | Scalable oversight and AI control extend human supervision |
Technical research effectiveness depends critically on timelines, adoption, and whether alignment problems are technically tractable.