Compounding Risks Analysis
AI Compounding Risks Analysis Model
Mathematical framework quantifying how AI risks compound beyond additive effects through four mechanisms (multiplicative probability, severity multiplication, defense negation, nonlinear effects), with racing+deceptive alignment showing 3-8% catastrophic probability and interaction coefficients of 2-10x. Provides specific cost-effectiveness estimates for interventions targeting compound pathways ($1-4M per 1% risk reduction) and demonstrates systematic 2-5x underestimation by traditional additive models.
Overview
When multiple AI risks occur simultaneously, their combined impact often dramatically exceeds simple addition. This mathematical framework analyzes how racing dynamics, deceptive alignment, and lock-in scenarios interact through four compounding mechanisms. The central insight: a world with three moderate risks isn't 3x as dangerous as one with a single risk—it can be 10-20x more dangerous due to multiplicative interactions.
Analysis of high-risk combinations reveals that racing+deceptive alignment scenarios carry 3-8% catastrophic probability, while mesa-optimization+scheming pathways show 2-6% existential risk. Traditional additive risk models systematically underestimate total danger by factors of 2-5x because they ignore how risks amplify each other's likelihood, severity, and defensive evasion.
The framework provides quantitative interaction coefficients (α values of 2-10x for severity multiplication, 3-6x for probability amplification) and mathematical models to correct this systematic underestimation. This matters for resource allocation: reducing compound pathways often provides higher leverage than addressing individual risks in isolation.
Risk Compounding Assessment
| Risk Combination | Interaction Type | Compound Probability | Severity Multiplier | Confidence Level |
|---|---|---|---|---|
| Racing + Deceptive Alignment | Probability multiplication | 15.8% vs 4.5% baseline | 3.5x | Medium |
| Deceptive + Lock-in | Severity multiplication | 8% | 8-10x | Medium |
| Expertise Atrophy + Corrigibility Failure | Defense negation | Variable | 3.3x | Medium-High |
| Mesa-opt + Scheming | Nonlinear combined | 2-6% catastrophic | Discontinuous | Medium |
| Epistemic Collapse + Democratic Failure | Threshold crossing | 8-20% | Qualitative change | Low |
Compounding Mechanisms Framework
Mathematical Foundation
Traditional additive models dramatically underestimate compound risk:
| Model Type | Formula | Typical Underestimate | Use Case |
|---|---|---|---|
| Naive Additive | 2-5x underestimate | Individual risk planning | |
| Multiplicative | 1.5-3x underestimate | Overlapping vulnerabilities | |
| Synergistic (Recommended) | Baseline accuracy | Compound risk assessment |
Synergistic Model (Full Specification):
Where α coefficients represent pairwise interaction strength and β coefficients capture three-way interactions.
Type 1: Multiplicative Probability
When Risk A increases the likelihood of Risk B:
| Scenario | P(Mesa-opt) | P(Deceptive | Mesa-opt) | Combined Probability | Compounding Factor |
|---|---|---|---|---|
| Baseline (no racing) | 15% | 30% | 4.5% | 1x |
| Moderate racing | 25% | 40% | 10% | 2.2x |
| Intense racing | 35% | 45% | 15.8% | 3.5x |
| Extreme racing | 50% | 55% | 27.5% | 6.1x |
Mechanism: Racing dynamics compress safety timelines → inadequate testing → higher probability of mesa-optimization → higher probability of deceptive alignment.
Type 2: Severity Multiplication
When combined risks produce worse outcomes than the sum of individual impacts:
| Individual Risk | Standalone Impact | Combined with Lock-in | Severity Multiplier |
|---|---|---|---|
| Deceptive Alignment | High (misaligned optimization) | Catastrophic (trapped forever) | 8-10x |
| Corrigibility Failure | Medium (hard to correct) | Severe (permanent embedding) | 4-6x |
| Sycophancy | Low (biased advice) | High (institutionalized bias) | 3-5x |
Type 3: Defense Negation
When Risk A disables safeguards against Risk B:
| Human Expertise Level | P(Detect Deceptive Behavior) | P(Catastrophe | Undetected) | Total Catastrophe Risk |
|---|---|---|---|
| Full expertise maintained | 60% | 15% | 6% |
| Moderate expertise atrophy | 35% | 30% | 19.5% (3.3x) |
| Severe expertise atrophy | 15% | 50% | 42.5% (7x) |
Type 4: Nonlinear Combined Effects
When interactions produce qualitatively different outcomes:
| Combined Stressors | Individual Effect | Compound Effect | Threshold Behavior |
|---|---|---|---|
| Epistemic degradation alone | Manageable stress on institutions | - | Linear response |
| Political polarization alone | Manageable stress on institutions | - | Linear response |
| Both together | - | Democratic system failure | Phase transition |
Diagram (loading…)
flowchart TD A[Individual Risks] --> B[Additive Model<br/>R₁ + R₂ + R₃] A --> C[Compound Model<br/>Σ + ΣΣα + ΣΣΣβ] B --> D[Underestimate<br/>2-5x too low] C --> E[Accurate Assessment<br/>Captures interactions] F[Racing Dynamics] --> G[Higher Mesa-opt Probability] G --> H[Higher Deceptive Alignment] H --> I[Lock-in Risk] I --> J[Catastrophic Outcome<br/>3-8% probability] style D fill:#ffcccc style E fill:#ccffcc style J fill:#ff9999
High-Risk Compound Combinations
Critical Interaction Matrix
| Risk A | Risk B | Interaction Strength (α) | Combined Catastrophe Risk | Evidence Source |
|---|---|---|---|---|
| Racing | Deceptive Alignment | 3.0-5.0 | 3-8% | Amodei et al. (2016)↗📄 paper★★★☆☆arXivConcrete Problems in AI SafetyWidely considered one of the most influential foundational papers in technical AI safety; frequently cited as a key reference for the research agenda pursued by groups like OpenAI, Anthropic, and DeepMind safety teams.Dario Amodei, Chris Olah, Jacob Steinhardt et al. (2016)2,962 citationsThis foundational paper by Amodei et al. identifies five practical AI safety research problems: avoiding side effects, avoiding reward hacking, scalable oversight, safe explorat...ai-safetyalignmenttechnical-safetyevaluation+5Source ↗ |
| Deceptive Alignment | Lock-in | 5.0-10.0 | 8-15% | Carlsmith (2021)↗📄 paper★★★☆☆arXivCarlsmith (2021)This paper addresses controllability in neural data-to-text generation through a Plan-then-Generate framework, relevant to AI safety concerns about controlling neural model outputs and ensuring generated text follows desired structures and constraints.Yixuan Su, David Vandyke, Sihui Wang et al. (2021)91 citationsThis paper introduces PlanGen, a Plan-then-Generate framework designed to enhance controllability in neural data-to-text generation models. The approach addresses a key limitati...capabilitiesevaluationeconomicrisk-interactions+1Source ↗ |
| Mesa-optimization | Scheming | 3.0-6.0 | 2-6% | Hubinger et al. (2019)↗📄 paper★★★☆☆arXivRisks from Learned OptimizationFoundational paper introducing mesa-optimization, analyzing risks when learned models become optimizers themselves, directly addressing transparency and safety concerns in advanced ML systems.Evan Hubinger, Chris van Merwijk, Vladimir Mikulik et al. (2019)This paper introduces the concept of mesa-optimization, where a learned model (such as a neural network) functions as an optimizer itself. The authors analyze two critical safet...alignmentsafetymesa-optimizationrisk-interactions+1Source ↗ |
| Expertise Atrophy | Corrigibility Failure | 2.0-4.0 | 5-12% | RAND Corporation↗🔗 web★★★★☆RAND CorporationRAND Report: Compounding Risks and Systemic Interactions in AI Safety (RRA2747-1)A RAND Corporation research report (RRA2747-1) examining compounding and interacting risks in complex systems; relevant to AI safety practitioners and policymakers thinking about systemic and cascading failure scenarios beyond single-point risk analysis.This RAND Corporation research report examines how multiple risks can interact and compound in complex AI and technology systems, applying systems-thinking frameworks to underst...existential-riskgovernancepolicyrisk-interactions+6Source ↗ |
| Concentration | Authoritarian Tools | 3.0-5.0 | 5-12% | Center for AI Safety↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) – HomepageCAIS is one of the leading AI safety research organizations; this homepage provides an entry point to their research, public statements, and field-building initiatives relevant to anyone working in or entering AI safety.The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, pub...ai-safetyexistential-riskalignmentfield-building+4Source ↗ |
Three-Way Compound Scenarios
| Scenario | Risk Combination | Compound Probability | Recovery Likelihood | Assessment |
|---|---|---|---|---|
| Technical Cascade | Racing + Mesa-opt + Deceptive | 3-8% | Very Low | Most dangerous technical pathway |
| Structural Lock-in | Deceptive + Lock-in + Authoritarian | 5-12% | Near-zero | Permanent misaligned control |
| Oversight Failure | Sycophancy + Expertise + Corrigibility | 5-15% | Low | No human check on behavior |
| Coordination Collapse | Epistemic + Trust + Democratic | 8-20% | Medium | Civilization coordination failure |
Quantitative Risk Calculation
Worked Example: Racing + Deceptive + Lock-in
Base Probabilities:
- Racing dynamics (R₁): 30%
- Deceptive alignment (R₂): 15%
- Lock-in scenario (R₃): 20%
Interaction Coefficients:
- α₁₂ = 2.0 (racing increases deceptive probability)
- α₁₃ = 1.5 (racing increases lock-in probability)
- α₂₃ = 3.0 (deceptive alignment strongly increases lock-in severity)
Calculation:
Interpretation: 92% probability that at least one major compound effect occurs, with severity multiplication making outcomes far worse than individual risks would suggest.
Scenario Probability Analysis
| Scenario | 2030 Probability | 2040 Probability | Compound Risk Level | Primary Drivers |
|---|---|---|---|---|
| Correlated Realization | 8% | 15% | Critical (0.9+) | Competitive pressure drives all risks |
| Gradual Compounding | 25% | 40% | High (0.6-0.8) | Slow interaction buildup |
| Successful Decoupling | 15% | 25% | Moderate (0.3-0.5) | Interventions break key links |
| Threshold Cascade | 12% | 20% | Variable | Sudden phase transition |
Expected Compound Risk by 2040:
Current State & Trajectory
Present Compound Risk Indicators
| Indicator | Current Level | Trend | 2030 Projection | Key Evidence |
|---|---|---|---|---|
| Racing intensity | Moderate-High | ↗ Increasing | High | AI lab competition↗🔗 web★★★★☆AnthropicAnthropic's Core Views on AI SafetyThis is Anthropic's official statement of organizational philosophy and research strategy, written in March 2023. It serves as a foundational document for understanding Anthropic's motivations and approach, making it essential reading for understanding one of the leading AI safety-focused labs.Anthropic outlines its foundational beliefs that transformative AI may arrive within a decade, that no one currently knows how to train robustly safe powerful AI systems, and th...ai-safetyalignmentexistential-riskcapabilities+6Source ↗, compute scaling↗🔗 web★★★★☆Epoch AIEpoch AI - AI Research and Forecasting OrganizationEpoch AI is a key reference organization for empirical data on AI scaling trends; their compute and training run databases are widely cited in AI safety and governance discussions.Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progr...capabilitiescomputegovernancepolicy+4Source ↗ |
| Technical risk correlation | Medium | ↗ Increasing | Medium-High | Mesa-optimization research↗✏️ blog★★★☆☆Alignment ForumAI Alignment ForumThe AI Alignment Forum is the primary online community for technical AI safety research; the featured post represents foundational agent-foundations work questioning utility function orthodoxy in decision theory.The AI Alignment Forum is a central community platform for technical AI safety and alignment research discussion. The featured post argues against 'reductive utility' (utility f...alignmentai-safetytechnical-safetydecision-theory+1Source ↗ |
| Lock-in pressure | Low-Medium | ↗ Increasing | Medium-High | Market concentration↗🔗 webMarket concentrationThis resource is a broken link (404 error) to a fiscal policy page unrelated to AI safety; it should likely be removed from the knowledge base or its URL corrected.This URL leads to a 404 error page on the Committee for a Responsible Federal Budget (CRFB) website. The intended resource appears to be a national debt tracking tool, but the p...policygovernanceSource ↗ |
| Expertise preservation | Medium | ↘ Decreasing | Low-Medium | RAND workforce analysis↗🔗 web★★★★☆RAND CorporationRAND Provides Objective Research Services and Public Policy AnalysisRAND Corporation's homepage serves as an entry point to a large body of policy-relevant research on AI governance, national security, and emerging technology risks, useful as a reference for policymakers and researchers in the AI safety space.RAND Corporation is a nonprofit research organization providing objective analysis and policy recommendations across a wide range of topics including national security, technolo...governancepolicyai-safetycybersecurity+4Source ↗ |
| Defensive capabilities | Medium | → Stable | Medium | AI safety funding↗🔗 web★★★☆☆AI ImpactsAI ImpactsAI Impacts is a key empirical research hub for AI safety; its expert surveys and wiki pages are frequently cited in discussions about AI timelines, risk probability, and strategic forecasting within the broader AI safety community.AI Impacts is a research organization that investigates empirical questions relevant to AI forecasting and safety, including AI timelines, discontinuous progress risks, and exis...ai-safetyexistential-riskcapabilitiesevaluation+3Source ↗ |
Key Trajectory Drivers
Accelerating Factors:
- Geopolitical competition intensifying AI race
- Scaling laws driving capability advances
- Economic incentives favoring rapid deployment
- Regulatory lag behind capability development
Mitigating Factors:
- Growing AI safety community and funding
- Industry voluntary commitments
- International coordination efforts (Seoul Declaration)
- Technical progress on interpretability and alignment
High-Leverage Interventions
Intervention Effectiveness Matrix
| Intervention | Compound Pathways Addressed | Risk Reduction | Annual Cost | Cost-Effectiveness |
|---|---|---|---|---|
| Reduce racing dynamics | Racing × all technical risks | 40-60% | $500M-1B | $2-4M per 1% reduction |
| Preserve human expertise | Expertise × all oversight risks | 30-50% | $200M-500M | $1-3M per 1% reduction |
| Prevent lock-in | Lock-in × all structural risks | 50-70% | $300M-600M | $1-2M per 1% reduction |
| Maintain epistemic health | Epistemic × democratic risks | 30-50% | $100M-300M | $1-2M per 1% reduction |
| International coordination | Racing × concentration × authoritarian | 30-50% | $200M-500M | $1-3M per 1% reduction |
Breaking Compound Cascades
Diagram (loading…)
flowchart TD A[Racing Dynamics] -->|α=2.0| B[Technical Risks] B -->|α=4.0| C[Lock-in Effects] C -->|α=3.5| D[Structural Risks] I1[Slow racing] -.->|Intervention 1| A I2[Preserve expertise] -.->|Intervention 2| B I3[Prevent lock-in] -.->|Intervention 3| C I4[Democratic safeguards] -.->|Intervention 4| D style A fill:#ffcccc style B fill:#ffcccc style C fill:#ffcccc style D fill:#ff9999 style I1 fill:#ccffcc style I2 fill:#ccffcc style I3 fill:#ccffcc style I4 fill:#ccffcc
Strategic Insights:
- Early intervention (before racing intensifies) provides highest leverage
- Breaking any major pathway (racing→technical, technical→lock-in) dramatically reduces compound risk
- Preserving human oversight capabilities acts as universal circuit breaker
Key Uncertainties & Cruxes
Critical Unknowns
Key Questions
- ?Are interaction coefficients stable across different AI capability levels?
- ?Which three-way combinations pose the highest existential risk?
- ?Can we detect threshold approaches before irreversible cascades begin?
- ?Do positive interactions (risks that reduce each other) meaningfully offset negative ones?
- ?How do defensive interventions interact - do they compound positively?
Expert Disagreement Areas
| Uncertainty | Optimistic View | Pessimistic View | Current Evidence |
|---|---|---|---|
| Interaction stability | Coefficients decrease as AI improves | Coefficients increase with capability | Mixed signals from capability research |
| Threshold existence | Gradual degradation, no sharp cutoffs | Clear tipping points exist | Limited historical analogies |
| Intervention effectiveness | Targeted interventions highly effective | System too complex for reliable intervention | Early positive results from responsible scaling |
| Timeline urgency | Compound effects emerge slowly (10+ years) | Critical combinations possible by 2030 | AGI timeline uncertainty |
Limitations & Model Validity
Methodological Constraints
Interaction coefficient uncertainty: α values are based primarily on expert judgment and theoretical reasoning rather than empirical measurement. Different analysts could reasonably propose coefficients differing by 2-3x, dramatically changing risk estimates. The Center for AI Safety↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) – HomepageCAIS is one of the leading AI safety research organizations; this homepage provides an entry point to their research, public statements, and field-building initiatives relevant to anyone working in or entering AI safety.The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, pub...ai-safetyexistential-riskalignmentfield-building+4Source ↗ and Future of Humanity Institute↗🔗 web★★★★☆Future of Humanity Institute**Future of Humanity Institute**FHI was a pioneering institution in AI safety and existential risk; this archived homepage is useful for historical context and understanding the institutional origins of the field, though the site is no longer actively updated following its April 2024 closure.The official website of the Future of Humanity Institute (FHI), an Oxford University research center that was foundational in establishing the fields of existential risk researc...ai-safetyexistential-riskalignmentgovernance+3Source ↗ have noted similar calibration challenges in compound risk assessment.
Higher-order effects: The model focuses on pairwise interactions but real catastrophic scenarios likely require 4+ simultaneous risks. The AI Risk Portfolio Analysis suggests higher-order terms may dominate in extreme scenarios.
Temporal dynamics: Risk probabilities and interaction strengths evolve as AI capabilities advance. Racing dynamics mild today may intensify rapidly; interaction effects manageable at current capability levels may become overwhelming as systems become more powerful.
Validation Challenges
| Challenge | Impact | Mitigation Strategy |
|---|---|---|
| Pre-catastrophe validation impossible | Cannot test model accuracy without experiencing failures | Use historical analogies, stress-test assumptions |
| Expert disagreement on coefficients | 2-3x uncertainty in final estimates | Report ranges, sensitivity analysis |
| Intervention interaction effects | Reducing one risk might increase others | Model defensive interactions explicitly |
| Threshold precision claims | False precision in "tipping point" language | Emphasize continuous degradation |
Sources & Resources
Academic Literature
| Source | Focus | Key Finding | Relevance |
|---|---|---|---|
| Amodei et al. (2016)↗📄 paper★★★☆☆arXivConcrete Problems in AI SafetyWidely considered one of the most influential foundational papers in technical AI safety; frequently cited as a key reference for the research agenda pursued by groups like OpenAI, Anthropic, and DeepMind safety teams.Dario Amodei, Chris Olah, Jacob Steinhardt et al. (2016)2,962 citationsThis foundational paper by Amodei et al. identifies five practical AI safety research problems: avoiding side effects, avoiding reward hacking, scalable oversight, safe explorat...ai-safetyalignmenttechnical-safetyevaluation+5Source ↗ | AI safety problems | Risk interactions in reward systems | High - foundational framework |
| Carlsmith (2021)↗📄 paper★★★☆☆arXivCarlsmith (2021)This paper addresses controllability in neural data-to-text generation through a Plan-then-Generate framework, relevant to AI safety concerns about controlling neural model outputs and ensuring generated text follows desired structures and constraints.Yixuan Su, David Vandyke, Sihui Wang et al. (2021)91 citationsThis paper introduces PlanGen, a Plan-then-Generate framework designed to enhance controllability in neural data-to-text generation models. The approach addresses a key limitati...capabilitiesevaluationeconomicrisk-interactions+1Source ↗ | Power-seeking AI | Lock-in mechanism analysis | High - severity multiplication |
| Hubinger et al. (2019)↗📄 paper★★★☆☆arXivRisks from Learned OptimizationFoundational paper introducing mesa-optimization, analyzing risks when learned models become optimizers themselves, directly addressing transparency and safety concerns in advanced ML systems.Evan Hubinger, Chris van Merwijk, Vladimir Mikulik et al. (2019)This paper introduces the concept of mesa-optimization, where a learned model (such as a neural network) functions as an optimizer itself. The authors analyze two critical safet...alignmentsafetymesa-optimizationrisk-interactions+1Source ↗ | Mesa-optimization | Deceptive alignment pathways | High - compound technical risks |
| Russell (2019)↗🔗 webHuman Compatible: Artificial Intelligence and the Problem of Control (Russell, 2019)Note: the page content appears to have loaded incorrectly showing Carlo Rovelli's 'The Order of Time' instead of Russell's book; metadata reflects the intended resource 'Human Compatible' by Stuart Russell, a foundational AI safety text widely cited in the field.Stuart Russell's 'Human Compatible' argues that the standard model of AI development—building systems that optimize fixed objectives—is fundamentally flawed and poses existentia...ai-safetyalignmentexistential-risktechnical-safety+2Source ↗ | AI alignment | Compound failure modes | Medium - conceptual framework |
Research Organizations
| Organization | Contribution | Key Publications |
|---|---|---|
| Anthropic↗🔗 web★★★★☆AnthropicAnthropic - AI Safety Company HomepageAnthropic is a primary institutional actor in AI safety; understanding their research agenda and deployment philosophy is relevant context for the broader AI safety ecosystem, though this homepage itself is a reference point rather than a primary technical resource.Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its famil...ai-safetyalignmentcapabilitiesinterpretability+6Source ↗ | Compound risk research | Constitutional AI↗📄 paper★★★☆☆arXivConstitutional AI: Harmlessness from AI FeedbackConstitutional AI paper presenting a method for training AI systems to be harmless using AI feedback based on a set of constitutional principles, addressing a fundamental challenge in AI alignment and safety.Yanuo Zhou (2025)2,673 citationsanthropickb-sourceSource ↗ |
| Center for AI Safety↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) – HomepageCAIS is one of the leading AI safety research organizations; this homepage provides an entry point to their research, public statements, and field-building initiatives relevant to anyone working in or entering AI safety.The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, pub...ai-safetyexistential-riskalignmentfield-building+4Source ↗ | Risk interaction analysis | AI Risk Statement↗🔗 web★★★★☆Center for AI SafetyStatement on AI Risk - Center for AI SafetyThis landmark 2023 open letter is frequently cited as a turning point in mainstream acknowledgment of existential AI risk, bringing together signatories from across the AI industry and policy world under a single succinct statement.A concise open letter coordinated by the Center for AI Safety stating that mitigating extinction-level risk from AI should be a global priority alongside pandemics and nuclear w...existential-riskai-safetygovernancepolicy+3Source ↗ |
| RAND Corporation↗🔗 web★★★★☆RAND CorporationRAND Provides Objective Research Services and Public Policy AnalysisRAND Corporation's homepage serves as an entry point to a large body of policy-relevant research on AI governance, national security, and emerging technology risks, useful as a reference for policymakers and researchers in the AI safety space.RAND Corporation is a nonprofit research organization providing objective analysis and policy recommendations across a wide range of topics including national security, technolo...governancepolicyai-safetycybersecurity+4Source ↗ | Expertise atrophy studies | AI Workforce Analysis↗🔗 web★★★★☆RAND CorporationRAND Report: Compounding Risks and Systemic Interactions in AI Safety (RRA2747-1)A RAND Corporation research report (RRA2747-1) examining compounding and interacting risks in complex systems; relevant to AI safety practitioners and policymakers thinking about systemic and cascading failure scenarios beyond single-point risk analysis.This RAND Corporation research report examines how multiple risks can interact and compound in complex AI and technology systems, applying systems-thinking frameworks to underst...existential-riskgovernancepolicyrisk-interactions+6Source ↗ |
| Future of Humanity Institute↗🔗 web★★★★☆Future of Humanity Institute**Future of Humanity Institute**FHI was a pioneering institution in AI safety and existential risk; this archived homepage is useful for historical context and understanding the institutional origins of the field, though the site is no longer actively updated following its April 2024 closure.The official website of the Future of Humanity Institute (FHI), an Oxford University research center that was foundational in establishing the fields of existential risk researc...ai-safetyexistential-riskalignmentgovernance+3Source ↗ | Existential risk modeling | Global Catastrophic Risks↗🔗 web★★★★☆Future of Humanity InstituteFuture of Humanity Institute (2019)FHI at Oxford was one of the foundational institutions in AI safety and existential risk research; this page documents its research agenda circa 2019, useful for understanding the intellectual origins of many key ideas in the field.This page outlines the major research areas pursued by the Future of Humanity Institute (FHI) at Oxford University, covering existential risk, AI safety, macrostrategy, and huma...existential-riskai-safetygovernancepolicy+4Source ↗ |
Policy & Governance
| Resource | Focus | Application |
|---|---|---|
| NIST AI Risk Management Framework↗🏛️ government★★★★★NISTNIST AI Risk Management FrameworkThe NIST AI RMF is a widely referenced U.S. government standard for AI risk governance, frequently cited in policy discussions and used by organizations building internal AI safety and compliance programs; relevant to AI safety researchers tracking institutional governance approaches.The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while pro...governancepolicyai-safetydeployment+4Source ↗ | Risk assessment methodology | Compound risk evaluation |
| UK AI Safety Institute↗🏛️ government★★★★☆UK AI Safety InstituteUK AI Safety Institute (AISI)AISI is a key institutional actor in AI safety, representing one of the first government-led efforts to systematically evaluate frontier AI models; its work and publications are directly relevant to governance, evaluation methodology, and international AI safety coordination.The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, deve...ai-safetygovernancepolicyevaluation+5Source ↗ | Safety evaluation | Interaction testing protocols |
| EU AI Act↗🔗 web★★★★☆European UnionEU Artificial Intelligence Act - Original Commission Proposal (2021)This is the original 2021 European Commission proposal for the EU AI Act; the final regulation was adopted in 2024 and is considered a landmark in AI governance, directly relevant to deployment standards and safety requirements for AI systems in the EU market.The European Commission's 2021 legislative proposal establishing harmonized rules for artificial intelligence across the EU, introducing a risk-based regulatory framework. It cl...governancepolicyai-safetydeployment+5Source ↗ | Regulatory framework | Compound risk regulation |
References
RAND Corporation is a nonprofit research organization providing objective analysis and policy recommendations across a wide range of topics including national security, technology, governance, and emerging risks. It produces influential studies on AI policy, cybersecurity, and global governance challenges. RAND's work is frequently cited by governments and policymakers worldwide.
Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progress. They produce empirical analyses and datasets to inform understanding of AI development trajectories and support better decision-making in AI governance and safety.
The official website of the Future of Humanity Institute (FHI), an Oxford University research center that was foundational in establishing the fields of existential risk research and AI safety. FHI closed on 16 April 2024 after approximately two decades of influential work. The site now serves as an archived record of the institution's history, research agenda, and legacy.
Stuart Russell's 'Human Compatible' argues that the standard model of AI development—building systems that optimize fixed objectives—is fundamentally flawed and poses existential risks. Russell proposes a new framework based on machines that are uncertain about human preferences and defer to humans, making AI inherently beneficial and safe by design.
The AI Alignment Forum is a central community platform for technical AI safety and alignment research discussion. The featured post argues against 'reductive utility' (utility functions over possible worlds) and proposes the Jeffrey-Bolker framework as an alternative that avoids ontological crises and computability constraints by grounding preferences in agent-relative events rather than universal physics.
The European Commission's 2021 legislative proposal establishing harmonized rules for artificial intelligence across the EU, introducing a risk-based regulatory framework. It classifies AI systems into prohibited, high-risk, and lower-risk categories, imposing requirements for transparency, human oversight, and conformity assessments on high-risk applications. This proposal initiated the legislative process that culminated in the world's first comprehensive AI regulation.
AI Impacts is a research organization that investigates empirical questions relevant to AI forecasting and safety, including AI timelines, discontinuous progress risks, and existential risk arguments. It maintains a wiki and blog featuring expert surveys, historical analyses, and structured arguments about transformative AI development. Notable outputs include periodic expert surveys on AI progress timelines.
This URL leads to a 404 error page on the Committee for a Responsible Federal Budget (CRFB) website. The intended resource appears to be a national debt tracking tool, but the page no longer exists at this location.
A concise open letter coordinated by the Center for AI Safety stating that mitigating extinction-level risk from AI should be a global priority alongside pandemics and nuclear war. The statement has been signed by hundreds of leading AI researchers, executives, and public figures including Geoffrey Hinton, Yoshua Bengio, Sam Altman, and Demis Hassabis, lending significant institutional credibility to existential AI risk concerns.
The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while promoting trustworthiness across design, development, deployment, and evaluation. It provides structured guidance organized around core functions and is accompanied by a Playbook, Roadmap, and a Generative AI Profile (2024) addressing risks specific to generative AI systems.
Anthropic outlines its foundational beliefs that transformative AI may arrive within a decade, that no one currently knows how to train robustly safe powerful AI systems, and that a multi-faceted empirically-driven approach to safety research is urgently needed. The post explains Anthropic's strategic rationale for pursuing safety work across multiple scenarios and research directions including scalable oversight, mechanistic interpretability, and process-oriented learning.
This paper introduces PlanGen, a Plan-then-Generate framework designed to enhance controllability in neural data-to-text generation models. The approach addresses a key limitation of existing neural models—their inability to control output structure—by separating planning from generation. Evaluated on ToTTo and WebNLG benchmarks, PlanGen demonstrates improved control over both intra-sentence and inter-sentence structure while achieving better generation quality and output diversity compared to previous state-of-the-art methods, as validated through human and automatic evaluations.
This page outlines the major research areas pursued by the Future of Humanity Institute (FHI) at Oxford University, covering existential risk, AI safety, macrostrategy, and human enhancement. It serves as a hub for understanding FHI's interdisciplinary approach to long-term risks facing humanity. The institute applies philosophy, mathematics, and social sciences to identify and mitigate catastrophic and existential risks.
The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, publishes surveys and statements, and supports field-building efforts across academia and industry. CAIS is notable for its broad coalition-building, including its widely-cited statement on AI extinction risk signed by leading researchers.
Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.
This paper introduces the concept of mesa-optimization, where a learned model (such as a neural network) functions as an optimizer itself. The authors analyze two critical safety concerns: (1) identifying when and why learned models become optimizers, and (2) understanding how a mesa-optimizer's objective function may diverge from its training loss and how to ensure alignment. The paper provides a comprehensive framework for understanding these phenomena and outlines important directions for future research in AI safety and transparency.
This foundational paper by Amodei et al. identifies five practical AI safety research problems: avoiding side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift. It frames these as concrete technical challenges arising from real-world ML system design, providing a research agenda that has significantly shaped the field of AI safety.
This RAND Corporation research report examines how multiple risks can interact and compound in complex AI and technology systems, applying systems-thinking frameworks to understand cascading failures and emergent dangers. It likely analyzes risk interactions that are not captured when evaluating hazards in isolation, offering policy-relevant insights for AI governance and safety planning.
The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, develops evaluation frameworks for frontier AI models, and works with international partners to inform global AI governance and policy.