AI Safety Intervention Portfolio
AI Safety Intervention Portfolio
Provides a strategic framework for AI safety resource allocation by mapping 13+ interventions against 4 risk categories, evaluating each on ITN dimensions, and identifying portfolio gaps (epistemic resilience severely neglected, technical work over-concentrated in frontier labs). Total field investment ~$650M annually with 1,100 FTEs (21% annual growth), but 85% of external funding from 5 sources and safety/capabilities ratio at only 0.5-1.3%. Recommends rebalancing from very high RLHF investment toward evaluations (very high priority), AI control and compute governance (both high priority), with epistemic resilience increasing from very low to medium allocation.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Tractability | Medium-High | Varies widely: evaluations (high), compute governance (high), international coordination (low). Coefficient Giving's 2025 RFP allocated $40M for technical safety research. |
| Scalability | High | Portfolio approach scales across 4 risk categories and multiple timelines. AI Safety Field Growth Analysis shows 21% annual FTE growth rate. |
| Current Maturity | Medium | Core interventions established; significant gaps in epistemic resilience (less than 5% of portfolio) and post-incident recovery (under 1%). |
| Research Workforce | ≈1,100 FTEs | 600 technical + 500 non-technical AI safety FTEs in 2025, up from 400 total in 2022 (AI Safety Field Growth Analysis). |
| Time Horizon | Near-Long | Near-term (evaluations, control) complement long-term work (interpretability, governance). International AI Safety Report 2025 emphasizes urgency. |
| Funding Level | $110-130M/year external | 2024 external funding. Early 2025 shows 40-50% acceleration with $67M committed through July. Internal lab spending adds $500-550M for ≈$650M total (Coefficient Giving analysis). |
| Funding Concentration | 85% from 5 sources | Coefficient Giving: $63.6M (60%); Jaan Tallinn: $20M; Eric Schmidt: $10M; AI Safety Fund: $10M; FLI: $5M |
| Safety/Capabilities Ratio | ≈0.5-1.3% | $600-650M safety vs $50B+ capabilities spending. FAS recommends 30% of compute for safety research. |
Key Links
| Source | Link |
|---|---|
| Official Website | mop.wiki |
| Wikipedia | en.wikipedia.org |
Overview
This page provides a strategic view of the AI safety intervention landscape, analyzing how different interventions address different risk categories. Rather than examining interventions individually, this portfolio view helps identify coverage gaps, complementarities, and allocation priorities.
The intervention landscape can be divided into several categories: technical approaches (alignment, interpretability, control), governance mechanisms (legislation, compute governance, international coordination), field building (talent, funding, community), and resilience measures (epistemic security, economic adaptation). Each category has different tractability profiles, timelines, and risk coverage—understanding these tradeoffs is essential for strategic resource allocation.
An effective safety portfolio requires both breadth (covering diverse failure modes) and depth (sufficient investment in each area to achieve impact). The current portfolio shows significant concentration in certain areas (RLHF, capability evaluations) while other areas remain relatively neglected (epistemic resilience, international coordination).
Field Growth Trajectory
| Metric | 2022 | 2025 | Growth Rate | Notes |
|---|---|---|---|---|
| Technical AI Safety FTEs | 300 | 600 | 21%/year | AI Safety Field Growth Analysis 2025 |
| Non-Technical AI Safety FTEs | 100 | 500 | 71%/year | Governance, policy, operations |
| Total AI Safety FTEs | 400 | 1,100 | 40%/year | Field-wide compound growth |
| AI Safety Organizations | ≈50 | ≈120 | 24%/year | Exponential growth since 2020 |
| Capabilities FTEs (comparison) | ≈3,000 | ≈15,000 | 30-40%/year | OpenAI alone: 300 → 3,000 |
Critical Comparison: While AI safety workforce has grown substantially, capabilities research is growing 30-40% per year. The ratio of capabilities to safety researchers has remained roughly constant at 10-15:1, meaning the absolute gap continues to widen.
Top Research Categories (by FTEs):
- Miscellaneous technical AI safety research
- LLM safety
- Interpretability
Intervention Categories and Risk Coverage
Diagram (loading…)
flowchart TD
subgraph Technical["Technical Approaches"]
INT[Interpretability]
CTRL[AI Control]
ALIGN[Alignment Research]
EVAL[Evaluations]
end
subgraph Governance["Governance"]
COMP[Compute Governance]
LEG[Legislation]
INTL[International Coordination]
RSP[Responsible Scaling]
end
subgraph Meta["Field Building & Resilience"]
FIELD[Field Building]
EPIST[Epistemic Resilience]
ECON[Economic Resilience]
end
subgraph Risks["Risk Categories"]
ACC[Accident Risks]
MIS[Misuse Risks]
STR[Structural Risks]
EPI[Epistemic Risks]
end
INT --> ACC
CTRL --> ACC
ALIGN --> ACC
EVAL --> ACC
EVAL --> MIS
COMP --> MIS
COMP --> STR
LEG --> MIS
LEG --> STR
INTL --> STR
RSP --> ACC
RSP --> MIS
FIELD --> ACC
FIELD --> STR
EPIST --> EPI
ECON --> STR
style ACC fill:#ffcccc
style MIS fill:#ffe6cc
style STR fill:#fff3cc
style EPI fill:#e6ccff
style Technical fill:#cce6ff
style Governance fill:#ccffcc
style Meta fill:#ffccffIntervention by Risk Matrix
This matrix shows how strongly each major intervention addresses each risk category. Ratings are based on current evidence and expert assessments.
| Intervention | Accident Risks | Misuse Risks | Structural Risks | Epistemic Risks | Primary Mechanism |
|---|---|---|---|---|---|
| Interpretability | High | Low | Low | -- | Detect deception and misalignment in model internals |
| AI Control | High | Medium | -- | -- | External constraints regardless of AI intentions |
| Evaluations | High | Medium | Low | -- | Pre-deployment testing for dangerous capabilities |
| RLHF/Constitutional AI | Medium | Medium | -- | -- | Train models to follow human preferences |
| Scalable Oversight | Medium | Low | -- | -- | Human supervision of superhuman systems |
| Compute Governance | Low | High | Medium | -- | Hardware chokepoints limit access |
| Export Controls | Low | High | Medium | -- | Restrict adversary access to training compute |
| Responsible Scaling | Medium | Medium | Low | -- | Capability thresholds trigger safety requirements |
| International Coordination | Low | Medium | High | -- | Reduce racing dynamics through agreements |
| AI Safety Institutes | Medium | Medium | Medium | -- | Government capacity for evaluation and oversight |
| Field Building | Medium | Low | Medium | Low | Grow talent pipeline and research capacity |
| Epistemic Security | -- | Low | Low | High | Protect collective truth-finding capacity |
| Content Authentication | -- | Medium | -- | High | Verify authentic content in synthetic era |
Legend: High = primary focus, addresses directly; Medium = secondary impact; Low = indirect or limited; -- = minimal relevance
Prioritization Framework
This framework evaluates interventions across the standard Importance-Tractability-Neglectedness (ITN) dimensions, with additional consideration for timeline fit and portfolio complementarity.
| Intervention | Tractability | Impact Potential | Neglectedness | Timeline Fit | Overall Priority |
|---|---|---|---|---|---|
| Interpretability | Medium | High | Low | Long | High |
| AI Control | High | Medium-High | Medium | Near | Very High |
| Evaluations | High | Medium | Low | Near | High |
| Compute Governance | High | High | Low | Near | Very High |
| International Coordination | Low | Very High | High | Long | High |
| Field Building | High | Medium | Medium | Ongoing | Medium-High |
| Epistemic Resilience | Medium | Medium | High | Near-Long | Medium-High |
| Scalable Oversight | Medium-Low | High | Medium | Long | Medium |
Prioritization Rationale
Very High Priority:
- AI Control scores highly because it provides near-term safety benefits (70-85% tractability for human-level systems) regardless of whether alignment succeeds. It represents a practical bridge during the transition period. Redwood Research received $1.2M for control research in 2024.
- Compute Governance is one of few levers creating physical constraints on AI development. Hardware chokepoints exist, some measures are already implemented (EU AI Act compute thresholds, US export controls), and impact potential is substantial. GovAI produces leading research on compute governance mechanisms.
High Priority:
- Interpretability is potentially essential if alignment proves difficult (only reliable way to detect sophisticated deception). MIT Technology Review named mechanistic interpretability a 2026 Breakthrough Technology. Anthropic's attribution graphs revealed hidden reasoning in Claude 3.5 Haiku. FAS recommends federal R&D funding through DARPA and NSF.
- Evaluations provide measurable near-term impact and are already standard practice at major labs. Coefficient Giving launched an RFP for capability evaluations ($200K-$5M grants). METR partners with Anthropic and OpenAI on frontier model evaluations. NIST invested $20M in AI Economic Security Centers.
- International Coordination has very high impact potential for addressing structural risks like racing dynamics, but low tractability given current geopolitical tensions. The International AI Safety Report 2025, led by Yoshua Bengio with 100+ authors from 30 countries, represents the largest global collaboration to date.
Medium-High Priority:
- Field Building and Epistemic Resilience are relatively neglected meta-level interventions that multiply the effectiveness of direct technical and governance work. 80,000 Hours notes good funding opportunities in AI safety exist for qualified researchers.
Portfolio Gaps and Complementarities
Coverage Gaps
Analysis of the current intervention portfolio reveals several areas where coverage is thin:
| Gap Area | Current Investment | Risk Exposure | Recommended Action |
|---|---|---|---|
| Epistemic Risks | Under 5% of portfolio ($3-5M/year) | Epistemic collapse, reality fragmentation | Increase to 8-10% of portfolio; invest in content authentication and epistemic infrastructure |
| Long-term Structural Risks | 4-6% of portfolio; international coordination is low tractability | Lock-in, concentration of power | Develop alternative coordination mechanisms; invest in governance research |
| Post-Incident Recovery | Under 1% of portfolio | All risk categories | Develop recovery protocols and resilience measures; allocate 3-5% of portfolio |
| Misuse by State Actors | Export controls are primary lever; $5-10M in policy research | Authoritarian tools, surveillance | Research additional governance mechanisms; increase to $15-25M |
| Independent Evaluation Capacity | 70%+ of evals done by labs themselves | Conflict of interest, verification gaps | Coefficient Giving's eval RFP addresses this with $200K-$5M grants |
Key Complementarities
Certain interventions work better together than in isolation:
Technical + Governance:
- AI Evaluations inform Responsible Scaling Policies thresholds
- Interpretability enables verification for
- AI Control provides safety margin while governance matures
Near-term + Long-term:
- Compute Governance buys time for Interpretability research
- AI Evaluations identify near-term risks while Scalable Oversight develops
- AI Safety Field Building and Community ensures capacity for future technical work
Prevention + Resilience:
- Technical safety research aims to prevent failures
- AI-Era Epistemic Security and economic resilience limit damage if prevention fails
- Both are needed for robust defense-in-depth
Portfolio Funding Allocation
The following table estimates 2024 funding levels by intervention area and compares them to recommended allocations based on neglectedness and impact potential. Total external AI safety funding was approximately $110-130 million in 2024, with Coefficient Giving providing ~60% of this amount.
| Intervention Area | Est. 2024 Funding | % of Total | Recommended Shift | Key Funders |
|---|---|---|---|---|
| RLHF/Training Methods | $15-35M | ≈25% | Decrease to 20% | Frontier labs (internal), academic grants |
| Interpretability | $15-25M | ≈18% | Maintain | Coefficient Giving, Superalignment Fast Grants ($10M) |
| Evaluations & Evals Infrastructure | $12-18M | ≈13% | Increase to 20% | CAIS ($1.5M), UK AISI, labs |
| AI Control Research | $1-12M | ≈9% | Increase to 15% | Redwood Research ($1.2M), Anthropic |
| Compute Governance | $1-10M | ≈7% | Increase to 12% | Government programs, policy organizations |
| Field Building & Talent | $10-15M | ≈11% | Maintain | 80,000 Hours, MATS, various fellowships |
| Governance & Policy | $1-12M | ≈9% | Increase to 12% | Coefficient Giving policy grants, government initiatives |
| International Coordination | $1-5M | ≈4% | Increase to 8% | UK/EU government initiatives (≈$14M total) |
| Epistemic Resilience | $1-4M | ≈3% | Increase to 8% | Very few dedicated funders |
2025 Funding Landscape Update
| Funder | 2024 Allocation | Focus Areas | Source |
|---|---|---|---|
| Coefficient Giving | $63.6M | Technical safety, governance, field building | 60% of external funding |
| Jaan Tallinn | $20M | Long-term alignment research | Personal foundation |
| Eric Schmidt (Schmidt Sciences) | $10M | Safety benchmarking, adversarial evaluation | Quick Market Pitch |
| AI Safety Fund | $10M | Collaborative research (Anthropic, Google, Microsoft, OpenAI) | Frontier Model Forum |
| Future of Life Institute | $5M | Smaller grants, fellowships | Diverse portfolio |
| Steven Schuurman Foundation | €5M/year | Various AI safety initiatives | Elastic co-founder |
| Total External | $110-130M | — | 2024 estimate |
2025 Trajectory: Early data (through July 2025) shows $67M already committed, putting the year on track to exceed 2024 totals by 40-50%.
Funding Gap Analysis
The funding landscape reveals several structural imbalances:
| Gap Type | Current State | Impact | Recommended Action |
|---|---|---|---|
| Climate vs AI safety | Climate philanthropy: ≈$1-15B; AI safety: ≈$130M | ≈100x disparity despite comparable catastrophic potential | Increase AI safety funding to at least $100M-1B annually |
| Capabilities vs safety | ≈$100B in AI data center capex (2024) vs ≈$130M safety | ≈1500:1 ratio | Redirect 0.5-1% of capabilities spending to safety |
| Funder concentration | Coefficient Giving: 60% of external funding | Single point of failure; limits diversity | Diversify funding sources; new initiatives like Humanity AI ($100M) |
| Talent pipeline | Over-optimized for researchers | Shortage in governance, operations, advocacy | Expand non-research talent programs |
Resource Allocation Assessment
Current vs. Recommended Allocation
| Area | Current Allocation | Recommended | Rationale |
|---|---|---|---|
| RLHF/Training | Very High | High | Deployed at scale but limited effectiveness against deceptive alignment |
| Interpretability | High | High | Rapid progress; potential for fundamental breakthroughs |
| Evaluations | High | Very High | Critical for identifying dangerous capabilities pre-deployment |
| AI Control | Medium | High | Near-term tractable; provides safety regardless of alignment |
| Compute Governance | Medium | High | One of few physical levers; already showing policy impact |
| International Coordination | Low | Medium | Low tractability but very high stakes |
| Epistemic Resilience | Very Low | Medium | Highly neglected; addresses underserved risk category |
| Field Building | Medium | Medium | Maintain current investment; returns are well-established |
Investment Concentration Risks
The current portfolio shows several structural vulnerabilities:
| Concentration Type | Current State | Risk | Mitigation |
|---|---|---|---|
| Funder concentration | Coefficient Giving provides ≈60% of external funding | Strategy changes affect entire field | Cultivate diverse funding sources |
| Geographic concentration | US and UK receive majority of funding | Limited global coordination capacity | Support emerging hubs (Berlin, Canada, Australia) |
| Frontier lab dependence | Most technical safety at Anthropic, OpenAI, DeepMind | Conflicts of interest; limited independent verification | Increase funding to MIRI ($1.1M), Redwood, ARC |
| Research over operations | Pipeline over-optimized for researchers | Shortage of governance, advocacy, operations talent | Expand non-research career paths |
| Technical over governance | Technical ~60% vs governance ≈15% of funding | Governance may be more neglected and tractable | Rebalance toward policy research |
| Prevention over resilience | Minimal investment in post-incident recovery | No fallback if prevention fails | Develop recovery protocols |
Strategic Considerations
Worldview Dependencies
Different beliefs about AI risk lead to different portfolio recommendations:
| Worldview | Prioritize | Deprioritize |
|---|---|---|
| Alignment is very hard | Interpretability, Control, International coordination | RLHF, Voluntary commitments |
| Misuse is the main risk | Compute governance, Content authentication, Legislation | Interpretability, Agent foundations |
| Short timelines | AI Control, Evaluations, Responsible scaling | Long-term governance research |
| Racing dynamics dominate | International coordination, Compute governance | Unilateral safety research |
| Epistemic collapse is likely | Epistemic security, Content authentication | Technical alignment |
Portfolio Robustness
A robust portfolio should satisfy the following criteria, which can help evaluate current gaps and guide future allocation:
| Robustness Criterion | Current Status | Gap Assessment | Target |
|---|---|---|---|
| Cover multiple failure modes | Accident risks: 60% coverage; Misuse: 50%; Structural: 30%; Epistemic: under 15% | Medium gap | 70%+ coverage across all categories |
| Prevention and resilience | ~95% prevention, ≈5% resilience | Large gap | 80% prevention, 20% resilience |
| Near-term and long-term balance | 55% near-term (evals, control), 45% long-term (interpretability, governance) | Small gap | Maintain current balance |
| Independent research capacity | Frontier labs: 70%+ of technical safety; Independents: under 30% | Medium gap | 50/50 split between labs and independents |
| Support multiple worldviews | Most interventions robust across scenarios | Small gap | Maintain |
| Geographic diversity | US/UK: 80%+ of funding; EU: 10%; ROW: under 10% | Medium gap | US/UK: 60%, EU: 20%, ROW: 20% |
| Funder diversity | 5 funders provide 85% of external funding; Coefficient Giving alone provides 60% | Large gap | No single funder greater than 25% |
Key Sources
| Source | Type | Relevance |
|---|---|---|
| Coefficient Giving Progress 2024 | Funder Report | Primary data on AI safety funding levels and priorities |
| AI Safety Funding Situation Overview | Analysis | Comprehensive breakdown of funding sources and gaps |
| AI Safety Needs More Funders | Policy Brief | Comparison to other catastrophic risk funding |
| AI Safety Field Growth Analysis 2025 | Research | Field growth metrics, 1,100 FTEs, 21% annual growth |
| International AI Safety Report 2025 | Global Report | 100+ authors, 30 countries, Yoshua Bengio lead |
| Future of Life AI Safety Index 2025 | Industry Assessment | 33 indicators across 6 domains for 7 leading companies |
| Coefficient Giving Technical AI Safety RFP | Grant Program | $40M allocation for technical safety research |
| Coefficient Giving Capability Evaluations RFP | Grant Program | $200K-$5M grants for evaluation infrastructure |
| America's AI Action Plan (July 2025) | Policy | US government AI priorities including evaluations ecosystem |
| Accelerating AI Interpretability (FAS) | Policy Brief | Federal funding recommendations for interpretability |
| 80,000 Hours: AI Risk | Career Guidance | Intervention prioritization and neglectedness analysis |
| RLHF Limitations Paper | Research | Evidence on limitations of current alignment methods |
| Carnegie AI Safety as Global Public Good | Policy Analysis | International coordination challenges and research priorities |
| ITU Annual AI Governance Report 2025 | Global Report | AI governance landscape across nations |
References
Open Philanthropy issued a request for proposals seeking technical AI safety research projects, signaling funding priorities and research directions the organization considers most valuable. The RFP outlines areas of interest including interpretability, scalable oversight, and related alignment challenges, aiming to grow the field by supporting researchers and organizations working on these problems.
A landmark international scientific assessment co-authored by 96 experts from 30 countries, providing a comprehensive overview of general-purpose AI capabilities, risks, and risk management approaches. It aims to establish shared scientific understanding across nations as a foundation for global AI governance. The report covers topics including capability evaluation, misuse risks, systemic risks, and mitigation strategies.
This research piece from Coefficient Giving argues that AI safety and security research is significantly underfunded relative to the risks involved, and makes the case for philanthropists and funders to increase financial support for the field. It examines funding gaps, highlights promising organizations and research areas, and encourages diversification of the funder base beyond a few major donors.
Open Philanthropy is a major philanthropic organization that funds work across global health, AI safety, biosecurity, and other cause areas. Their grants database provides transparency into which organizations and research directions receive funding. They are one of the largest funders of AI safety and existential risk research.
The AI Safety Fund (AISF) is a $10 million+ collaborative initiative launched in October 2023 by Anthropic, Google, Microsoft, and OpenAI (via the Frontier Model Forum) along with philanthropic partners to fund independent AI safety and security research. It has distributed two rounds of grants focused on responsible frontier AI development, public safety risk reduction, and standardized third-party capability evaluations. The fund is now directly managed by the Frontier Model Forum following the closure of its original administrator, the Meridian Institute.
Open Philanthropy reviews its 2024 philanthropic activities and outlines priorities for 2025, with emphasis on AI safety research funding, strategic partnerships, and grants spanning global health and catastrophic risk reduction. The report provides transparency into one of the field's largest funders and signals where major resources will flow in the AI safety ecosystem.
The Centre for the Governance of AI (GovAI) research hub aggregates policy-relevant technical and governance research on frontier AI systems, covering topics from biosecurity and cybercrime to labor market impacts and AI auditing. It serves as a comprehensive repository of GovAI's publications spanning multiple years and research themes. The page indexes papers addressing near-term and long-term risks from advanced AI systems.
8Mechanistic interpretability: 10 Breakthrough Technologies 2026 | MIT Technology ReviewMIT Technology Review▸
MIT Technology Review highlights mechanistic interpretability as one of its top breakthrough technologies of 2026, summarizing progress by Anthropic, OpenAI, and Google DeepMind in mapping LLM internal features and tracing model reasoning pathways. The piece covers both sparse autoencoder-based feature mapping and chain-of-thought monitoring as complementary tools for understanding model behavior. It notes ongoing debate about whether LLMs will ever be fully interpretable.
METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.
NIST awarded $20 million to MITRE Corporation to establish two AI Economic Security Centers focused on advancing AI in U.S. manufacturing productivity and protecting critical infrastructure from cyberthreats. The initiative implements recommendations from the White House's America's AI Action Plan and represents a public-private partnership model for accelerating AI development and deployment in national priority areas.
OpenAI's Superalignment team announced a fast grants program to fund external researchers working on technical alignment and interpretability research, aiming to solve the problem of aligning superintelligent AI systems within four years. The program offers grants ranging from $100K to $2M to support academic labs, graduate students, and independent researchers. This reflects OpenAI's strategy of leveraging external talent to accelerate progress on their superalignment research agenda.
An investigative journalism piece examining the philanthropic landscape funding AI regulation and safety efforts, identifying key donors, foundations, and grant recipients shaping the AI governance space. The article maps financial flows from major funders to policy organizations, research groups, and advocacy efforts focused on AI oversight.
This paper provides a critical sociotechnical analysis of Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) as alignment approaches for large language models. The authors argue that while RLHF aims to achieve honesty, harmlessness, and helpfulness, these methods have significant theoretical and practical limitations in capturing the complexity of human ethics and ensuring genuine AI safety. The paper identifies inherent tensions in alignment goals and highlights neglected ethical issues, ultimately calling for a more nuanced and reflective approach to RLxF implementation in AI development.
The Future of Life Institute's AI Safety Index Summer 2025 systematically evaluates leading AI companies on safety practices, finding widespread deficiencies across risk management, transparency, and existential safety planning. Anthropic receives the highest grade of C+, indicating that even the best-performing company falls significantly short of adequate safety standards. The report serves as a comparative benchmark for industry accountability.
This 80,000 Hours problem profile argues that AI systems pursuing goals misaligned with human values could seek to accumulate power and resources in ways that permanently undermine human control. It outlines why this risk is among the most pressing long-term problems and explains the mechanisms by which advanced AI could pose catastrophic or existential threats. The piece serves as an accessible entry point into the case for prioritizing AI safety work.
The ITU's 2025 AI Governance Report provides a comprehensive overview of global AI governance developments, frameworks, and policy trends from an international telecommunications and ICT standards perspective. It examines how nations and international bodies are approaching AI regulation, safety standards, and coordination challenges. The report serves as a reference document for policymakers and stakeholders navigating the evolving AI governance landscape.