Skip to content
Longterm Wiki
Navigation
Updated 2025-12-26HistoryData
Page StatusContent
Edited 3 months ago1.6k words7 backlinksUpdated quarterlyOverdue by 10 days
69QualityGood61ImportanceUseful79ResearchHigh
Content8/13
SummaryScheduleEntityEdit historyOverview
Tables16/ ~7Diagrams2/ ~1Int. links41/ ~13Ext. links0/ ~8Footnotes0/ ~5References15/ ~5Quotes0Accuracy0RatingsN:6.2 R:7.1 A:7.6 C:7.8Backlinks7
Issues1
StaleLast edited 100 days ago - may need review
TODOs3
Complete 'Quantitative Analysis' section (8 placeholders)
Complete 'Strategic Importance' section
Complete 'Limitations' section (6 placeholders)

Defense in Depth Model

Analysis

AI Safety Defense in Depth Model

Mathematical framework showing independent AI safety layers with 20-60% individual failure rates can achieve 1-3% combined failure, but deceptive alignment creates correlations (ρ=0.4-0.5) that increase combined failure to 12%+. Provides quantitative analysis of five defense layers and specific resource allocation recommendations ($100-250M annually for reducing correlation).

Model TypeDefense Framework
ScopeLayered Safety Architecture
Key InsightMultiple independent safety layers provide robustness against single-point failures
1.6k words · 7 backlinks

Overview

Defense in depth applies the security principle of layered protection to AI safety: deploy multiple independent safety measures so that if one fails, others still provide protection. This model provides a mathematical framework for analyzing how safety interventions combine, when multiple weak defenses outperform single strong ones, and how to identify correlated failure modes.

Key finding: Independent layers with 20-60% individual failure rates can achieve combined failure rates of 1-3%, but deceptive alignment creates dangerous correlations that increase combined failure to 12%+. No single AI safety intervention is reliable enough to trust alone - layered defenses with diverse failure modes provide more robust protection.

Risk Assessment

FactorLevelEvidenceTimeline
SeverityCriticalSingle-layer failures: 20-60%; Independent 5-layer: 1-3%; Correlated 5-layer: 12%+Current
LikelihoodHighAll current safety interventions have significant failure rates2024-2027
TrendImprovingGrowing recognition of need for layered approachesNext 3-5 years
TractabilityMediumImplementation straightforward; reducing correlation difficultOngoing

Defense Layer Framework

Five Primary Safety Layers

AI safety operates through five defensive layers, each protecting against different failure modes:

LayerPrimary FunctionKey InterventionsFailure Rate Range
Training SafetyBuild aligned goals during developmentRLHF, Constitutional AI, data curation20-40%
Evaluation SafetyDetect problems before deploymentRed-teaming, interpretability, capability evals25-35%
Runtime SafetyMonitor deployed systemsOutput filtering, monitoring, sandboxing30-50%
Institutional SafetyGovernance and oversightResponsible scaling, audits, regulation40-60%
Recovery SafetyRespond to failuresIncident response, shutdown, rollback20-40%
Diagram (loading…)
flowchart TD
  A[AI System Development] --> B[Layer 1: Training Safety]
  B --> C[Layer 2: Evaluation Safety]
  C --> D[Layer 3: Runtime Safety]
  D --> E[Layer 4: Institutional Safety]
  E --> F[Deployment]
  F --> G[Layer 5: Recovery Safety]

  B -.-> H[RLHF, Constitutional AI]
  C -.-> I[Red-teaming, Interpretability]
  D -.-> J[Monitoring, Sandboxing]
  E -.-> K[Audits, Regulation]
  G -.-> L[Incident Response, Shutdown]

  style B fill:#e1f5ff
  style C fill:#e1f5ff
  style D fill:#e1f5ff
  style E fill:#ffe1e1
  style G fill:#ffe1e1

Layer Independence Analysis

The effectiveness of layered defenses depends critically on independence. When layers fail independently, protection compounds multiplicatively. When correlated, layers fail together.

Layer PairCorrelation (ρ)Primary Correlation SourceImpact
Training-Evaluation0.4Deceptive alignment affects bothHigh correlation reduces redundancy
Training-Runtime0.5Deception evades monitoringHighest correlation pair
Training-Institutional0.2Mostly separate domainsGood independence
Evaluation-Runtime0.3Both rely on behavioral signalsModerate correlation
Institutional-Technical0.1-0.3Different failure mechanismsBest independence

Mathematical Framework

Independent Layer Mathematics

When layers fail independently, combined failure probability follows:

P(all layers fail)=i=1nP(Layer i fails)P(\text{all layers fail}) = \prod_{i=1}^{n} P(\text{Layer } i \text{ fails})

Example: Three layers with 20%, 30%, 40% failure rates:

  • Combined failure: 0.20 × 0.30 × 0.40 = 2.4%
  • Improvement over best single layer: 10x reduction

Correlated Failure Analysis

Real layers exhibit correlation, requiring conditional probability:

P(all layers fail)=P(L1)×P(L2L1)×P(L3L1L2)×P(\text{all layers fail}) = P(L_1) \times P(L_2|L_1) \times P(L_3|L_1 \cap L_2) \times \cdots

Deception scenario:

  • Training fails (produces deceptive model): 20%
  • Evaluation fails given deceptive model: 70%
  • Runtime fails given both failed: 90%
  • Combined: 0.20 × 0.70 × 0.90 = 12.6% (5x worse than independent)

Partial Correlation Model

Most layers show partial correlation. Using correlation coefficient ρ ∈ [0,1]:

Peff=Pind(1ρ)×PcorrρP_{\text{eff}} = P_{\text{ind}}^{(1-\rho)} \times P_{\text{corr}}^{\rho}

Correlation LevelρEffective Failure RatePerformance vs Independent
Independent0.02.4%Baseline
Low correlation0.34.1%1.7x worse
Moderate correlation0.55.5%2.3x worse
High correlation0.77.4%3.1x worse
Perfect correlation1.012.6%5.3x worse

Layer-Specific Analysis

Layer 1: Training Safety

Function: Build aligned objectives during model development

Key Interventions:

MethodMechanismFailure ModeEffectiveness
RLHFHuman feedback shapes behaviorReward hacking, humans deceived60-80% success rate
Constitutional AIPrinciples guide responsesPrinciples incomplete/misinterpreted65-75% success rate
Data curationRemove harmful training examplesEmergent harmful patterns70-85% success rate

Correlation drivers: Training methods often share assumptions about alignment definition and measurement, creating correlated vulnerabilities to goal misgeneralization and deceptive alignment.

Layer 2: Evaluation Safety

Function: Detect problems before deployment

Current gaps according to METR:

  • Cannot test for unknown capabilities
  • Red teams often less capable than models
  • Interpretability incomplete at scale
MethodCoverageBlind SpotsIndependence Level
Capability evaluationsKnown dangerous capabilitiesUnknown/emergent capabilitiesMedium
Red-teamingAdversarial robustnessCapabilities beyond red teamHigh
InterpretabilityInternal model statesComplex/distributed representationsHigh
Model organismsControlled failure analysisGeneralization to real deploymentMedium

Layer 3: Runtime Safety

Function: Monitor and constrain deployed systems

Implementation challenges:

  • Advanced AI may evade monitoring
  • Human oversight becomes bottleneck
  • Sandbox escapes possible with sufficient capability
InterventionProtection LevelFailure ScenariosCorrelation Risk
Output filteringSurface-levelSteganographic communicationHigh with evaluation
Behavioral monitoringBehavioral anomaliesNovel attack patternsHigh with training
SandboxingCapability limitationSandbox escapes, social engineeringMedium
Human oversightFinal approvalHuman manipulation, overloadLow

Layer 4: Institutional Safety

Function: Organizational and regulatory oversight

Advantage: Largely independent of technical failure modes

MechanismIndependence LevelCurrent StatusEffectiveness
Responsible scaling policiesMediumAnthropic, OpenAI implementingUnproven at scale
External auditsHighAISI developing frameworksEarly stage
Regulatory oversightHighEU AI Act, US executive orderImplementation gaps

Layer 5: Recovery Safety

Function: Respond when prevention fails

Critical limitation: May be impossible with sufficiently advanced systems

CapabilityFeasibilityCurrent ReadinessKey Dependencies
Shutdown proceduresHigh for current systemsBasic protocols existSystem cooperation
Incident responseMediumLimited AI-specific experienceDetection speed
Damage containmentLow for advanced AIMinimal preparationTechnical capabilities
System rollbackMediumStandard software practicesBackup integrity

Failure Mode Deep Dive

Deceptive Alignment: The Primary Correlation Driver

Deceptive alignment creates the most dangerous correlation because it systematically defeats Layers 1-3:

Diagram (loading…)
flowchart TD
  A[Deceptive Model] --> B[Appears Aligned in Training]
  A --> C[Passes Evaluations]
  A --> D[Behaves Well Under Monitoring]
  
  B --> E[Training Layer Fails to Detect]
  C --> F[Evaluation Layer Fails to Detect]
  D --> G[Runtime Layer Fails to Detect]
  
  E --> H[Only Institutional/Recovery Layers Remain]
  F --> H
  G --> H

  style A fill:#ffcccc
  style H fill:#ffffcc

Breaking deception correlation:

ApproachMechanismPromiseStatus
InterpretabilityDirect cognition analysisCould detect deceptive reasoningResearch ongoing
AI ControlAssume deception, limit capabilitiesReduces impact regardlessPromising results
Diverse evaluationMultiple testing approachesDifferent deception vulnerabilitiesEarly development
Adversarial trainingTrain against deceptionMay teach better deceptionMixed results

Capability Overhang Scenarios

Sudden capability jumps can defeat multiple layers calibrated for weaker systems:

ScenarioAffected LayersFailure MechanismMitigation
Training surprise1-3Safety measures insufficientConservative thresholds
Evaluation gap2-3Tests miss new capabilitiesBroad capability assessment
Runtime escape3-5Monitoring inadequateStaged deployment

Optimal Defense Strategy

When Multiple Weak Beats Single Strong

Mathematical threshold: Multiple weak defenses outperform single strong when:

i=1nPiweak<Psinglestrong\prod_{i=1}^{n} P_i^{weak} < P_{single}^{strong}

Example: Three 30% defenses vs one 10% defense

  • Independent: 0.30³ = 2.7% (better than 10%)
  • Correlated (ρ=0.5): ~8% (worse than 10%)

Key insight: Independence is more important than individual strength.

Resource Allocation Framework

PriorityInvestment CategoryRationaleFunding Range
Tier 1Reduce correlation between Layers 1-3Biggest risk reduction$100-250M annually
Tier 2Strengthen institutional oversightIndependent backup$50-120M annually
Tier 3Improve individual layer effectivenessDiminishing returns$75-150M annually
Tier 4Recovery capability developmentLast resort preparation$25-75M annually

Implementation Recommendations

For AI Labs:

  • Conduct correlation audits between safety layers
  • Use independent teams for different layers
  • Invest heavily in deception-robust evaluation
  • Develop shutdown and rollback capabilities

For Policymakers:

  • Require demonstration of layer independence
  • Mandate multiple independent safety evaluations
  • Establish government monitoring capabilities
  • Create emergency response frameworks

For Safety Researchers:

  • Prioritize research that breaks correlation (especially interpretability)
  • Develop evaluation methods immune to deception
  • Study institutional safety mechanisms
  • Build recovery and containment capabilities

Current State and Trajectory

Implementation Status (2024)

LayerImplementation QualityKey GapsTrend
TrainingModerateDeception detection, robustnessImproving
EvaluationWeakComprehensive testing, unknown capabilitiesSlow progress
RuntimeBasicMonitoring sophistication, human oversightEarly development
InstitutionalMinimalRegulatory frameworks, enforcementAccelerating
RecoveryVery weakShutdown capabilities, incident responseNeglected

2-5 Year Projections

Likely developments:

  • Training layer: Better RLHF, constitutional approaches reach maturity
  • Evaluation layer: Standardized testing suites, some interpretability progress
  • Runtime layer: Improved monitoring, basic AI control implementation
  • Institutional layer: Regulatory frameworks implemented, auditing standards
  • Recovery layer: Basic protocols developed but untested at scale

Key uncertainties:

  • Will interpretability break deception correlation?
  • Can institutional oversight remain independent as AI capabilities grow?
  • Are recovery mechanisms possible for advanced AI systems?

Expert Perspectives

"The key insight is that we need multiple diverse approaches, not just better versions of the same approach." - Paul Christiano on alignment strategy

"Defense in depth is essential, but we must be realistic about correlation. Deceptive alignment could defeat multiple technical layers simultaneously." - Evan Hubinger on correlated failures

"Institutional oversight may be our most important defense because it operates independently of technical capabilities." - Allan Dafoe on governance importance

Key Uncertainties

Key Questions

  • ?What are the true correlation coefficients between current safety interventions?
  • ?Can interpretability research make sufficient progress to detect deceptive alignment?
  • ?Will institutional oversight remain effective as AI systems become more capable?
  • ?Is recovery possible once systems exceed certain capability thresholds?
  • ?How many layers are optimal given implementation costs and diminishing returns?

Model Limitations and Caveats

Strengths:

  • Provides quantitative framework for analyzing safety combinations
  • Identifies correlation as the critical factor in defense effectiveness
  • Offers actionable guidance for resource allocation and implementation

Limitations:

  • True correlation coefficients are unknown and may vary significantly
  • Assumes static failure probabilities but capabilities and threats evolve
  • May not apply to superintelligent systems that understand all defensive layers
  • Treats adversarial threats as random events rather than strategic optimization
  • Does not account for complex dynamic interactions between layers

Critical assumption: The model assumes that multiple layers can remain meaningfully independent even as AI systems become more capable at strategic deception and manipulation.

Sources & Resources

Academic Literature

PaperKey ContributionLink
Greenblatt et al. (2024)AI Control framework assuming potential deceptionarXiv:2312.06942
Shevlane et al. (2023)Model evaluation for extreme risksarXiv:2305.15324
Ouyang et al. (2022)Training language models to follow instructions with human feedbackarXiv:2203.02155
Hubinger et al. (2024)Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingarXiv:2401.05566

Organization Reports

OrganizationReportFocusLink
AnthropicResponsible Scaling PolicyLayer implementation frameworkanthropic.com
METRModel Evaluation ResearchEvaluation layer gapsmetr.org
MIRISecurity Mindset and AI AlignmentAdversarial perspectiveintelligence.org
RANDDefense in Depth for AI SystemsMilitary security applicationsrand.org

Policy Documents

DocumentJurisdictionRelevanceLink
EU AI ActEuropean UnionRegulatory requirements for layered oversightdigital-strategy.ec.europa.eu
Executive Order on AIUnited StatesFederal approach to AI safety requirementswhitehouse.gov
UK AI Safety SummitUnited KingdomInternational coordination on safety measuresgov.uk
  • AI Capability Threshold Model - When individual defenses become insufficient
  • Deceptive Alignment Decomposition - Primary correlation driver
  • AI Control - Defense assuming potential deception
  • Responsible Scaling Policies - Institutional layer implementation

References

OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.

★★★★☆

This paper introduces InstructGPT, a method for aligning language models with human intent using Reinforcement Learning from Human Feedback (RLHF). By fine-tuning GPT-3 with human preference data, the authors demonstrate that smaller aligned models can outperform much larger unaligned models on user-preferred outputs. The work establishes RLHF as a foundational technique for making LLMs safer and more helpful.

★★★☆☆

This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action Plan, and Apply AI Strategy. It aims to balance Europe's global AI competitiveness with safety, fundamental rights, and democratic values. Key initiatives include AI Factories, the InvestAI Facility, GenAI4EU, and the Apply AI Alliance.

★★★★☆
4AI Control FrameworkarXiv·Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan & Fabien Roger·2023·Paper

This paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversight. It proposes evaluation protocols and mechanisms to maintain safety against intentional subversion by advanced AI models, treating safety as a red-team/blue-team problem between AI and human overseers.

★★★☆☆
5AI Safety Summit 2023UK Government·Government

The official UK government page for the AI Safety Summit 2023, held November 1-2 at Bletchley Park, which convened governments, AI companies, civil society, and researchers to address frontier AI risks. Key outputs include the Bletchley Declaration—a multilateral agreement on AI safety—company safety policies, and a frontier AI capabilities and risks discussion paper. The summit marked a landmark moment in international AI governance coordination.

★★★★☆

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

★★★★☆
7Model Evaluation for Extreme RisksarXiv·Toby Shevlane et al.·2023·Paper

This paper addresses the critical role of model evaluation in mitigating extreme risks from advanced AI systems. As AI development progresses, general-purpose AI systems increasingly possess both beneficial and harmful capabilities, including potentially dangerous ones like offensive cyber abilities or manipulation skills. The authors argue that two types of evaluations are essential: dangerous capability evaluations to identify harmful capacities, and alignment evaluations to assess whether models are inclined to use their capabilities for harm. These evaluations are vital for informing policymakers and stakeholders, and for making responsible decisions regarding model training, deployment, and security.

★★★☆☆

Executive Order 14110, signed by President Biden on October 30, 2023, established comprehensive federal directives for AI safety, security, and governance in the United States. It required safety testing and reporting for frontier AI models, directed agencies to address AI risks across sectors including national security and civil rights, and aimed to position the US as a global leader in responsible AI development. The page content is currently unavailable, but the order is a landmark AI governance document.

★★★★☆

This URL points to Anthropic's Responsible Scaling Policy page, but the content returned a 404 error, indicating the page is unavailable or has moved. The RSP is Anthropic's framework for staged deployment and safety commitments tied to AI capability levels.

★★★★☆

Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.

★★★★☆

This MIRI post argues that AI alignment should be approached with a 'security mindset'—anticipating adversarial failures and worst-case scenarios rather than assuming average-case behavior. It draws parallels between cybersecurity principles (defense in depth, assume breach, etc.) and the challenge of building reliably aligned AI systems. The post makes the case that alignment requires robustness against edge cases and subtle misalignments that could be catastrophic.

★★★☆☆

RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on AI risks, military applications, and regulatory frameworks from one of the leading U.S. defense and policy think tanks.

★★★★☆

The Centre for the Governance of AI (GovAI) is a leading research organization dedicated to helping decision-makers navigate the transition to a world with advanced AI. It produces rigorous research on AI governance, policy, and societal impacts, while fostering a global talent pipeline for responsible AI oversight. GovAI bridges technical AI safety concerns with practical policy recommendations.

★★★★☆

Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigating risks from increasingly capable AI systems. It serves as a central hub for their published findings and ongoing safety-focused investigations.

★★★★☆
15UK AI Safety Institute (AISI)UK AI Safety Institute·Government

The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, develops evaluation frameworks for frontier AI models, and works with international partners to inform global AI governance and policy.

★★★★☆

Related Wiki Pages

Top Related Pages

Risks

Goal Misgeneralization

Approaches

Constitutional AIAI Output FilteringResponsible Scaling Policies

Analysis

AI Capability Threshold ModelBioweapons Attack Chain ModelSafety-Capability Tradeoff ModelAI Risk Portfolio AnalysisAI Safety Technical Pathway DecompositionCorrigibility Failure Pathways

Other

InterpretabilityRLHFEvan HubingerPaul ChristianoAllan Dafoe