Page StatusContent

Edited 7 weeks ago1.6k words1 backlinks

Updated quarterlyDue in 6 weeks

Summary

Mathematical framework showing independent AI safety layers with 20-60% individual failure rates can achieve 1-3% combined failure, but deceptive alignment creates correlations (ρ=0.4-0.5) that increase combined failure to 12%+. Provides quantitative analysis of five defense layers and specific resource allocation recommendations ($100-250M annually for reducing correlation).

TODOs3

Complete 'Quantitative Analysis' section (8 placeholders)

Complete 'Strategic Importance' section

Complete 'Limitations' section (6 placeholders)

Defense in Depth Model

Model

AI Safety Defense in Depth Model

Model TypeDefense Framework

ScopeLayered Safety Architecture

Key InsightMultiple independent safety layers provide robustness against single-point failures

Parameters

1.6k words · 1 backlinks

Model

AI Safety Defense in Depth Model

Model TypeDefense Framework

ScopeLayered Safety Architecture

Key InsightMultiple independent safety layers provide robustness against single-point failures

Parameters

1.6k words · 1 backlinks

Overview

Defense in depth applies the security principle of layered protection to AI safety: deploy multiple independent safety measures so that if one fails, others still provide protection. This model provides a mathematical framework for analyzing how safety interventions combine, when multiple weak defenses outperform single strong ones, and how to identify correlated failure modes.

Key finding: Independent layers with 20-60% individual failure rates can achieve combined failure rates of 1-3%, but deceptive alignment creates dangerous correlations that increase combined failure to 12%+. No single AI safety intervention is reliable enough to trust alone - layered defenses with diverse failure modes provide more robust protection.

Risk Assessment

Factor	Level	Evidence	Timeline
Severity	Critical	Single-layer failures: 20-60%; Independent 5-layer: 1-3%; Correlated 5-layer: 12%+	Current
Likelihood	High	All current safety interventions have significant failure rates	2024-2027
Trend	Improving	Growing recognition of need for layered approaches	Next 3-5 years
Tractability	Medium	Implementation straightforward; reducing correlation difficult	Ongoing

Defense Layer Framework

Five Primary Safety Layers

AI safety operates through five defensive layers, each protecting against different failure modes:

Layer	Primary Function	Key Interventions	Failure Rate Range
Training Safety	Build aligned goals during development	RLHF↗, Constitutional AI, data curation	20-40%
Evaluation Safety	Detect problems before deployment	Red-teaming, interpretability, capability evals	25-35%
Runtime Safety	Monitor deployed systems	Output filtering, monitoring, sandboxing	30-50%
Institutional Safety	Governance and oversight	Responsible scaling, audits, regulation	40-60%
Recovery Safety	Respond to failures	Incident response, shutdown, rollback	20-40%

Loading diagram...

Layer Independence Analysis

The effectiveness of layered defenses depends critically on independence. When layers fail independently, protection compounds multiplicatively. When correlated, layers fail together.

Layer Pair	Correlation (ρ)	Primary Correlation Source	Impact
Training-Evaluation	0.4	Deceptive alignment affects both	High correlation reduces redundancy
Training-Runtime	0.5	Deception evades monitoring	Highest correlation pair
Training-Institutional	0.2	Mostly separate domains	Good independence
Evaluation-Runtime	0.3	Both rely on behavioral signals	Moderate correlation
Institutional-Technical	0.1-0.3	Different failure mechanisms	Best independence

Mathematical Framework

Independent Layer Mathematics

When layers fail independently, combined failure probability follows:

$P(\text{all layers fail}) = \prod_{i=1}^{n} P(\text{Layer } i \text{ fails})$

Example: Three layers with 20%, 30%, 40% failure rates:

Combined failure: 0.20 × 0.30 × 0.40 = 2.4%
Improvement over best single layer: 10x reduction

Correlated Failure Analysis

Real layers exhibit correlation, requiring conditional probability:

$P(\text{all layers fail}) = P(L_1) \times P(L_2|L_1) \times P(L_3|L_1 \cap L_2) \times \cdots$

Deception scenario:

Training fails (produces deceptive model): 20%
Evaluation fails given deceptive model: 70%
Runtime fails given both failed: 90%
Combined: 0.20 × 0.70 × 0.90 = 12.6% (5x worse than independent)

Partial Correlation Model

Most layers show partial correlation. Using correlation coefficient ρ ∈ [0,1]:

$P_{\text{eff}} = P_{\text{ind}}^{(1-\rho)} \times P_{\text{corr}}^{\rho}$

Correlation Level	ρ	Effective Failure Rate	Performance vs Independent
Independent	0.0	2.4%	Baseline
Low correlation	0.3	4.1%	1.7x worse
Moderate correlation	0.5	5.5%	2.3x worse
High correlation	0.7	7.4%	3.1x worse
Perfect correlation	1.0	12.6%	5.3x worse

Layer-Specific Analysis

Layer 1: Training Safety

Function: Build aligned objectives during model development

Key Interventions:

Method	Mechanism	Failure Mode	Effectiveness
RLHF	Human feedback shapes behavior	Reward hacking, humans deceived	60-80% success rate
Constitutional AI	Principles guide responses	Principles incomplete/misinterpreted	65-75% success rate
Data curation	Remove harmful training examples	Emergent harmful patterns	70-85% success rate

Correlation drivers: Training methods often share assumptions about alignment definition and measurement, creating correlated vulnerabilities to goal misgeneralization and deceptive alignment.

Layer 2: Evaluation Safety

Function: Detect problems before deployment

Current gaps according to METR↗:

Cannot test for unknown capabilities
Red teams often less capable than models
Interpretability incomplete at scale

Method	Coverage	Blind Spots	Independence Level
Capability evaluations	Known dangerous capabilities	Unknown/emergent capabilities	Medium
Red-teaming	Adversarial robustness	Capabilities beyond red team	High
Interpretability	Internal model states	Complex/distributed representations	High
Model organisms	Controlled failure analysis	Generalization to real deployment	Medium

Layer 3: Runtime Safety

Function: Monitor and constrain deployed systems

Implementation challenges:

Advanced AI may evade monitoring
Human oversight becomes bottleneck
Sandbox escapes possible with sufficient capability

Intervention	Protection Level	Failure Scenarios	Correlation Risk
Output filtering	Surface-level	Steganographic communication	High with evaluation
Behavioral monitoring	Behavioral anomalies	Novel attack patterns	High with training
Sandboxing	Capability limitation	Sandbox escapes, social engineering	Medium
Human oversight	Final approval	Human manipulation, overload	Low

Layer 4: Institutional Safety

Function: Organizational and regulatory oversight

Advantage: Largely independent of technical failure modes

Mechanism	Independence Level	Current Status	Effectiveness
Responsible scaling policies	Medium	Anthropic↗, OpenAI↗ implementing	Unproven at scale
External audits	High	AISI↗ developing frameworks	Early stage
Regulatory oversight	High	EU AI Act↗, US executive order↗	Implementation gaps

Layer 5: Recovery Safety

Function: Respond when prevention fails

Critical limitation: May be impossible with sufficiently advanced systems

Capability	Feasibility	Current Readiness	Key Dependencies
Shutdown procedures	High for current systems	Basic protocols exist	System cooperation
Incident response	Medium	Limited AI-specific experience	Detection speed
Damage containment	Low for advanced AI	Minimal preparation	Technical capabilities
System rollback	Medium	Standard software practices	Backup integrity

Failure Mode Deep Dive

Deceptive Alignment: The Primary Correlation Driver

Deceptive alignment creates the most dangerous correlation because it systematically defeats Layers 1-3:

Loading diagram...

Breaking deception correlation:

Approach	Mechanism	Promise	Status
Interpretability	Direct cognition analysis	Could detect deceptive reasoning	Research ongoing↗
AI Control	Assume deception, limit capabilities	Reduces impact regardless	Promising results↗
Diverse evaluation	Multiple testing approaches	Different deception vulnerabilities	Early development
Adversarial training	Train against deception	May teach better deception	Mixed results

Capability Overhang Scenarios

Sudden capability jumps can defeat multiple layers calibrated for weaker systems:

Scenario	Affected Layers	Failure Mechanism	Mitigation
Training surprise	1-3	Safety measures insufficient	Conservative thresholds
Evaluation gap	2-3	Tests miss new capabilities	Broad capability assessment
Runtime escape	3-5	Monitoring inadequate	Staged deployment↗

Optimal Defense Strategy

When Multiple Weak Beats Single Strong

Mathematical threshold: Multiple weak defenses outperform single strong when:

$\prod_{i=1}^{n} P_i^{weak} < P_{single}^{strong}$

Example: Three 30% defenses vs one 10% defense

Independent: 0.30³ = 2.7% (better than 10%)
Correlated (ρ=0.5): ~8% (worse than 10%)

Key insight: Independence is more important than individual strength.

Resource Allocation Framework

Priority	Investment Category	Rationale	Funding Range
Tier 1	Reduce correlation between Layers 1-3	Biggest risk reduction	$100-250M annually
Tier 2	Strengthen institutional oversight	Independent backup	$50-120M annually
Tier 3	Improve individual layer effectiveness	Diminishing returns	$75-150M annually
Tier 4	Recovery capability development	Last resort preparation	$25-75M annually

Implementation Recommendations

For AI Labs:

Conduct correlation audits between safety layers
Use independent teams for different layers
Invest heavily in deception-robust evaluation
Develop shutdown and rollback capabilities

For Policymakers:

Require demonstration of layer independence
Mandate multiple independent safety evaluations
Establish government monitoring capabilities
Create emergency response frameworks

For Safety Researchers:

Prioritize research that breaks correlation (especially interpretability)
Develop evaluation methods immune to deception
Study institutional safety mechanisms
Build recovery and containment capabilities

Current State and Trajectory

Implementation Status (2024)

Layer	Implementation Quality	Key Gaps	Trend
Training	Moderate	Deception detection, robustness	Improving
Evaluation	Weak	Comprehensive testing, unknown capabilities	Slow progress
Runtime	Basic	Monitoring sophistication, human oversight	Early development
Institutional	Minimal	Regulatory frameworks, enforcement	Accelerating
Recovery	Very weak	Shutdown capabilities, incident response	Neglected

2-5 Year Projections

Likely developments:

Training layer: Better RLHF, constitutional approaches reach maturity
Evaluation layer: Standardized testing suites, some interpretability progress
Runtime layer: Improved monitoring, basic AI control implementation
Institutional layer: Regulatory frameworks implemented, auditing standards
Recovery layer: Basic protocols developed but untested at scale

Key uncertainties:

Will interpretability break deception correlation?
Can institutional oversight remain independent as AI capabilities grow?
Are recovery mechanisms possible for advanced AI systems?

Expert Perspectives

"The key insight is that we need multiple diverse approaches, not just better versions of the same approach." - Paul Christiano on alignment strategy

"Defense in depth is essential, but we must be realistic about correlation. Deceptive alignment could defeat multiple technical layers simultaneously." - Evan Hubinger↗ on correlated failures

"Institutional oversight may be our most important defense because it operates independently of technical capabilities." - Allan Dafoe↗ on governance importance

Key Uncertainties

Key Questions

?What are the true correlation coefficients between current safety interventions?
?Can interpretability research make sufficient progress to detect deceptive alignment?
?Will institutional oversight remain effective as AI systems become more capable?
?Is recovery possible once systems exceed certain capability thresholds?
?How many layers are optimal given implementation costs and diminishing returns?

Model Limitations and Caveats

Strengths:

Provides quantitative framework for analyzing safety combinations
Identifies correlation as the critical factor in defense effectiveness
Offers actionable guidance for resource allocation and implementation

Limitations:

True correlation coefficients are unknown and may vary significantly
Assumes static failure probabilities but capabilities and threats evolve
May not apply to superintelligent systems that understand all defensive layers
Treats adversarial threats as random events rather than strategic optimization
Does not account for complex dynamic interactions between layers

Critical assumption: The model assumes that multiple layers can remain meaningfully independent even as AI systems become more capable at strategic deception and manipulation.

Sources & Resources

Academic Literature

Paper	Key Contribution	Link
Greenblatt et al. (2024)	AI Control framework assuming potential deception	arXiv:2312.06942↗
Shevlane et al. (2023)	Model evaluation for extreme risks	arXiv:2305.15324↗
Ouyang et al. (2022)	Training language models to follow instructions with human feedback	arXiv:2203.02155↗
Hubinger et al. (2024)	Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training	arXiv:2401.05566↗

Organization Reports

Organization	Report	Focus	Link
Anthropic	Responsible Scaling Policy	Layer implementation framework	anthropic.com↗
METR	Model Evaluation Research	Evaluation layer gaps	metr.org↗
MIRI	Security Mindset and AI Alignment	Adversarial perspective	intelligence.org↗
RAND	Defense in Depth for AI Systems	Military security applications	rand.org↗

Policy Documents

Document	Jurisdiction	Relevance	Link
EU AI Act	European Union	Regulatory requirements for layered oversight	digital-strategy.ec.europa.eu↗
Executive Order on AI	United States	Federal approach to AI safety requirements	whitehouse.gov↗
UK AI Safety Summit	United Kingdom	International coordination on safety measures	gov.uk↗

Related Models and Concepts

AI Capability Threshold Model - When individual defenses become insufficient
Deceptive Alignment Decomposition - Primary correlation driver
AI Control - Defense assuming potential deception
Responsible Scaling Policies - Institutional layer implementation

Defense in Depth Model

AI Safety Defense in Depth Model

AI Safety Defense in Depth Model

Overview

Risk Assessment

Defense Layer Framework

Five Primary Safety Layers

Layer Independence Analysis

Mathematical Framework

Independent Layer Mathematics

Correlated Failure Analysis

Partial Correlation Model

Layer-Specific Analysis

Layer 1: Training Safety

Layer 2: Evaluation Safety

Layer 3: Runtime Safety

Layer 4: Institutional Safety

Layer 5: Recovery Safety

Failure Mode Deep Dive

Deceptive Alignment: The Primary Correlation Driver

Capability Overhang Scenarios

Optimal Defense Strategy

When Multiple Weak Beats Single Strong

Resource Allocation Framework

Implementation Recommendations

Current State and Trajectory

Implementation Status (2024)

2-5 Year Projections

Expert Perspectives

Key Uncertainties

Key Questions

Model Limitations and Caveats

Sources & Resources

Academic Literature

Organization Reports

Policy Documents

Related Models and Concepts

Related Pages

Top Related Pages

Societal Resilience

E252

E451

E259

E93

Models

Concepts

Transition Model