Defense in Depth Model
AI Safety Defense in Depth Model
Mathematical framework showing independent AI safety layers with 20-60% individual failure rates can achieve 1-3% combined failure, but deceptive alignment creates correlations (ρ=0.4-0.5) that increase combined failure to 12%+. Provides quantitative analysis of five defense layers and specific resource allocation recommendations ($100-250M annually for reducing correlation).
Overview
Defense in depth applies the security principle of layered protection to AI safety: deploy multiple independent safety measures so that if one fails, others still provide protection. This model provides a mathematical framework for analyzing how safety interventions combine, when multiple weak defenses outperform single strong ones, and how to identify correlated failure modes.
Key finding: Independent layers with 20-60% individual failure rates can achieve combined failure rates of 1-3%, but deceptive alignment creates dangerous correlations that increase combined failure to 12%+. No single AI safety intervention is reliable enough to trust alone - layered defenses with diverse failure modes provide more robust protection.
Risk Assessment
| Factor | Level | Evidence | Timeline |
|---|---|---|---|
| Severity | Critical | Single-layer failures: 20-60%; Independent 5-layer: 1-3%; Correlated 5-layer: 12%+ | Current |
| Likelihood | High | All current safety interventions have significant failure rates | 2024-2027 |
| Trend | Improving | Growing recognition of need for layered approaches | Next 3-5 years |
| Tractability | Medium | Implementation straightforward; reducing correlation difficult | Ongoing |
Defense Layer Framework
Five Primary Safety Layers
AI safety operates through five defensive layers, each protecting against different failure modes:
| Layer | Primary Function | Key Interventions | Failure Rate Range |
|---|---|---|---|
| Training Safety | Build aligned goals during development | RLHF↗📄 paper★★★☆☆arXivTraining Language Models to Follow Instructions with Human FeedbackThis is the seminal InstructGPT paper from OpenAI that popularized RLHF as the dominant alignment training paradigm; it directly underpins ChatGPT and is essential reading for anyone studying LLM alignment techniques.Long Ouyang, Jeff Wu, Xu Jiang et al. (2022)19,177 citationsThis paper introduces InstructGPT, a method for aligning language models with human intent using Reinforcement Learning from Human Feedback (RLHF). By fine-tuning GPT-3 with hum...alignmentcapabilitiestrainingevaluation+4Source ↗, Constitutional AI, data curation | 20-40% |
| Evaluation Safety | Detect problems before deployment | Red-teaming, interpretability, capability evals | 25-35% |
| Runtime Safety | Monitor deployed systems | Output filtering, monitoring, sandboxing | 30-50% |
| Institutional Safety | Governance and oversight | Responsible scaling, audits, regulation | 40-60% |
| Recovery Safety | Respond to failures | Incident response, shutdown, rollback | 20-40% |
Diagram (loading…)
flowchart TD A[AI System Development] --> B[Layer 1: Training Safety] B --> C[Layer 2: Evaluation Safety] C --> D[Layer 3: Runtime Safety] D --> E[Layer 4: Institutional Safety] E --> F[Deployment] F --> G[Layer 5: Recovery Safety] B -.-> H[RLHF, Constitutional AI] C -.-> I[Red-teaming, Interpretability] D -.-> J[Monitoring, Sandboxing] E -.-> K[Audits, Regulation] G -.-> L[Incident Response, Shutdown] style B fill:#e1f5ff style C fill:#e1f5ff style D fill:#e1f5ff style E fill:#ffe1e1 style G fill:#ffe1e1
Layer Independence Analysis
The effectiveness of layered defenses depends critically on independence. When layers fail independently, protection compounds multiplicatively. When correlated, layers fail together.
| Layer Pair | Correlation (ρ) | Primary Correlation Source | Impact |
|---|---|---|---|
| Training-Evaluation | 0.4 | Deceptive alignment affects both | High correlation reduces redundancy |
| Training-Runtime | 0.5 | Deception evades monitoring | Highest correlation pair |
| Training-Institutional | 0.2 | Mostly separate domains | Good independence |
| Evaluation-Runtime | 0.3 | Both rely on behavioral signals | Moderate correlation |
| Institutional-Technical | 0.1-0.3 | Different failure mechanisms | Best independence |
Mathematical Framework
Independent Layer Mathematics
When layers fail independently, combined failure probability follows:
Example: Three layers with 20%, 30%, 40% failure rates:
- Combined failure: 0.20 × 0.30 × 0.40 = 2.4%
- Improvement over best single layer: 10x reduction
Correlated Failure Analysis
Real layers exhibit correlation, requiring conditional probability:
Deception scenario:
- Training fails (produces deceptive model): 20%
- Evaluation fails given deceptive model: 70%
- Runtime fails given both failed: 90%
- Combined: 0.20 × 0.70 × 0.90 = 12.6% (5x worse than independent)
Partial Correlation Model
Most layers show partial correlation. Using correlation coefficient ρ ∈ [0,1]:
| Correlation Level | ρ | Effective Failure Rate | Performance vs Independent |
|---|---|---|---|
| Independent | 0.0 | 2.4% | Baseline |
| Low correlation | 0.3 | 4.1% | 1.7x worse |
| Moderate correlation | 0.5 | 5.5% | 2.3x worse |
| High correlation | 0.7 | 7.4% | 3.1x worse |
| Perfect correlation | 1.0 | 12.6% | 5.3x worse |
Layer-Specific Analysis
Layer 1: Training Safety
Function: Build aligned objectives during model development
Key Interventions:
| Method | Mechanism | Failure Mode | Effectiveness |
|---|---|---|---|
| RLHF | Human feedback shapes behavior | Reward hacking, humans deceived | 60-80% success rate |
| Constitutional AI | Principles guide responses | Principles incomplete/misinterpreted | 65-75% success rate |
| Data curation | Remove harmful training examples | Emergent harmful patterns | 70-85% success rate |
Correlation drivers: Training methods often share assumptions about alignment definition and measurement, creating correlated vulnerabilities to goal misgeneralization and deceptive alignment.
Layer 2: Evaluation Safety
Function: Detect problems before deployment
Current gaps according to METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗:
- Cannot test for unknown capabilities
- Red teams often less capable than models
- Interpretability incomplete at scale
| Method | Coverage | Blind Spots | Independence Level |
|---|---|---|---|
| Capability evaluations | Known dangerous capabilities | Unknown/emergent capabilities | Medium |
| Red-teaming | Adversarial robustness | Capabilities beyond red team | High |
| Interpretability | Internal model states | Complex/distributed representations | High |
| Model organisms | Controlled failure analysis | Generalization to real deployment | Medium |
Layer 3: Runtime Safety
Function: Monitor and constrain deployed systems
Implementation challenges:
- Advanced AI may evade monitoring
- Human oversight becomes bottleneck
- Sandbox escapes possible with sufficient capability
| Intervention | Protection Level | Failure Scenarios | Correlation Risk |
|---|---|---|---|
| Output filtering | Surface-level | Steganographic communication | High with evaluation |
| Behavioral monitoring | Behavioral anomalies | Novel attack patterns | High with training |
| Sandboxing | Capability limitation | Sandbox escapes, social engineering | Medium |
| Human oversight | Final approval | Human manipulation, overload | Low |
Layer 4: Institutional Safety
Function: Organizational and regulatory oversight
Advantage: Largely independent of technical failure modes
| Mechanism | Independence Level | Current Status | Effectiveness |
|---|---|---|---|
| Responsible scaling policies | Medium | Anthropic↗🔗 web★★★★☆AnthropicAnthropic - AI Safety Company HomepageAnthropic is a primary institutional actor in AI safety; understanding their research agenda and deployment philosophy is relevant context for the broader AI safety ecosystem, though this homepage itself is a reference point rather than a primary technical resource.Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its famil...ai-safetyalignmentcapabilitiesinterpretability+6Source ↗, OpenAI↗🔗 web★★★★☆OpenAIOpenAI Official HomepageOpenAI is a central organization in the AI safety and capabilities landscape; this homepage links to their models, research publications, and policy positions, making it a useful reference point for tracking frontier AI development.OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial g...capabilitiesalignmentgovernancedeployment+5Source ↗ implementing | Unproven at scale |
| External audits | High | AISI↗🏛️ government★★★★☆UK AI Safety InstituteUK AI Safety Institute (AISI)AISI is a key institutional actor in AI safety, representing one of the first government-led efforts to systematically evaluate frontier AI models; its work and publications are directly relevant to governance, evaluation methodology, and international AI safety coordination.The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, deve...ai-safetygovernancepolicyevaluation+5Source ↗ developing frameworks | Early stage |
| Regulatory oversight | High | EU AI Act↗🔗 web★★★★☆European UnionEuropean approach to artificial intelligenceThis is the official European Commission policy hub for AI governance, directly relevant to AI safety researchers tracking how major jurisdictions are regulating and shaping AI development through binding law and strategic investment.This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action P...governancepolicyai-safetydeployment+4Source ↗, US executive order↗🏛️ government★★★★☆White HouseBiden Administration AI Executive Order 14110This landmark 2023 US executive order was a major federal AI governance milestone; note the White House page may be unavailable as the order was rescinded by Executive Order on January 20, 2025 by the incoming Trump administration.Executive Order 14110, signed by President Biden on October 30, 2023, established comprehensive federal directives for AI safety, security, and governance in the United States. ...governancepolicyai-safetydeployment+5Source ↗ | Implementation gaps |
Layer 5: Recovery Safety
Function: Respond when prevention fails
Critical limitation: May be impossible with sufficiently advanced systems
| Capability | Feasibility | Current Readiness | Key Dependencies |
|---|---|---|---|
| Shutdown procedures | High for current systems | Basic protocols exist | System cooperation |
| Incident response | Medium | Limited AI-specific experience | Detection speed |
| Damage containment | Low for advanced AI | Minimal preparation | Technical capabilities |
| System rollback | Medium | Standard software practices | Backup integrity |
Failure Mode Deep Dive
Deceptive Alignment: The Primary Correlation Driver
Deceptive alignment creates the most dangerous correlation because it systematically defeats Layers 1-3:
Diagram (loading…)
flowchart TD A[Deceptive Model] --> B[Appears Aligned in Training] A --> C[Passes Evaluations] A --> D[Behaves Well Under Monitoring] B --> E[Training Layer Fails to Detect] C --> F[Evaluation Layer Fails to Detect] D --> G[Runtime Layer Fails to Detect] E --> H[Only Institutional/Recovery Layers Remain] F --> H G --> H style A fill:#ffcccc style H fill:#ffffcc
Breaking deception correlation:
| Approach | Mechanism | Promise | Status |
|---|---|---|---|
| Interpretability | Direct cognition analysis | Could detect deceptive reasoning | Research ongoing↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyThis is Anthropic's research landing page, useful as a starting point for discovering their published work on safety and alignment, but not a standalone paper or primary source in itself.Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigati...ai-safetyalignmentinterpretabilitytechnical-safety+4Source ↗ |
| AI Control | Assume deception, limit capabilities | Reduces impact regardless | Promising results↗📄 paper★★★☆☆arXivAI Control FrameworkFoundational paper by Redwood Research introducing 'AI control' as a complement to alignment research, focusing on maintaining safety guarantees against deceptive or adversarial AI in agentic settings. Highly influential in practical AI safety discourse.Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan et al. (2023)123 citationsThis paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversi...ai-safetytechnical-safetyred-teamingevaluation+3Source ↗ |
| Diverse evaluation | Multiple testing approaches | Different deception vulnerabilities | Early development |
| Adversarial training | Train against deception | May teach better deception | Mixed results |
Capability Overhang Scenarios
Sudden capability jumps can defeat multiple layers calibrated for weaker systems:
| Scenario | Affected Layers | Failure Mechanism | Mitigation |
|---|---|---|---|
| Training surprise | 1-3 | Safety measures insufficient | Conservative thresholds |
| Evaluation gap | 2-3 | Tests miss new capabilities | Broad capability assessment |
| Runtime escape | 3-5 | Monitoring inadequate | Staged deployment↗🔗 web★★★★☆AnthropicAnthropic Responsible Scaling PolicyThis URL is meant to host Anthropic's Responsible Scaling Policy, a key governance document, but is currently returning a 404 error. Users should search for the current RSP document directly on anthropic.com or via archived versions.This URL points to Anthropic's Responsible Scaling Policy page, but the content returned a 404 error, indicating the page is unavailable or has moved. The RSP is Anthropic's fra...policygovernancedeploymentai-safety+4Source ↗ |
Optimal Defense Strategy
When Multiple Weak Beats Single Strong
Mathematical threshold: Multiple weak defenses outperform single strong when:
Example: Three 30% defenses vs one 10% defense
- Independent: 0.30³ = 2.7% (better than 10%)
- Correlated (ρ=0.5): ~8% (worse than 10%)
Key insight: Independence is more important than individual strength.
Resource Allocation Framework
| Priority | Investment Category | Rationale | Funding Range |
|---|---|---|---|
| Tier 1 | Reduce correlation between Layers 1-3 | Biggest risk reduction | $100-250M annually |
| Tier 2 | Strengthen institutional oversight | Independent backup | $50-120M annually |
| Tier 3 | Improve individual layer effectiveness | Diminishing returns | $75-150M annually |
| Tier 4 | Recovery capability development | Last resort preparation | $25-75M annually |
Implementation Recommendations
For AI Labs:
- Conduct correlation audits between safety layers
- Use independent teams for different layers
- Invest heavily in deception-robust evaluation
- Develop shutdown and rollback capabilities
For Policymakers:
- Require demonstration of layer independence
- Mandate multiple independent safety evaluations
- Establish government monitoring capabilities
- Create emergency response frameworks
For Safety Researchers:
- Prioritize research that breaks correlation (especially interpretability)
- Develop evaluation methods immune to deception
- Study institutional safety mechanisms
- Build recovery and containment capabilities
Current State and Trajectory
Implementation Status (2024)
| Layer | Implementation Quality | Key Gaps | Trend |
|---|---|---|---|
| Training | Moderate | Deception detection, robustness | Improving |
| Evaluation | Weak | Comprehensive testing, unknown capabilities | Slow progress |
| Runtime | Basic | Monitoring sophistication, human oversight | Early development |
| Institutional | Minimal | Regulatory frameworks, enforcement | Accelerating |
| Recovery | Very weak | Shutdown capabilities, incident response | Neglected |
2-5 Year Projections
Likely developments:
- Training layer: Better RLHF, constitutional approaches reach maturity
- Evaluation layer: Standardized testing suites, some interpretability progress
- Runtime layer: Improved monitoring, basic AI control implementation
- Institutional layer: Regulatory frameworks implemented, auditing standards
- Recovery layer: Basic protocols developed but untested at scale
Key uncertainties:
- Will interpretability break deception correlation?
- Can institutional oversight remain independent as AI capabilities grow?
- Are recovery mechanisms possible for advanced AI systems?
Expert Perspectives
"The key insight is that we need multiple diverse approaches, not just better versions of the same approach." - Paul Christiano on alignment strategy
"Defense in depth is essential, but we must be realistic about correlation. Deceptive alignment could defeat multiple technical layers simultaneously." - Evan Hubinger↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyThis is Anthropic's research landing page, useful as a starting point for discovering their published work on safety and alignment, but not a standalone paper or primary source in itself.Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigati...ai-safetyalignmentinterpretabilitytechnical-safety+4Source ↗ on correlated failures
"Institutional oversight may be our most important defense because it operates independently of technical capabilities." - Allan Dafoe↗🏛️ government★★★★☆Centre for the Governance of AIGovAI helps decision-makers navigate the transition to a world with advanced AI, by producing rigorous research and fostering talent." name="description"/><meta content="GovAI | HomeGovAI is one of the most prominent AI governance research organizations globally; their publications on AI policy, international coordination, and existential risk governance are frequently cited in AI safety literature and policy discussions.The Centre for the Governance of AI (GovAI) is a leading research organization dedicated to helping decision-makers navigate the transition to a world with advanced AI. It produ...governanceai-safetypolicyexistential-risk+4Source ↗ on governance importance
Key Uncertainties
Key Questions
- ?What are the true correlation coefficients between current safety interventions?
- ?Can interpretability research make sufficient progress to detect deceptive alignment?
- ?Will institutional oversight remain effective as AI systems become more capable?
- ?Is recovery possible once systems exceed certain capability thresholds?
- ?How many layers are optimal given implementation costs and diminishing returns?
Model Limitations and Caveats
Strengths:
- Provides quantitative framework for analyzing safety combinations
- Identifies correlation as the critical factor in defense effectiveness
- Offers actionable guidance for resource allocation and implementation
Limitations:
- True correlation coefficients are unknown and may vary significantly
- Assumes static failure probabilities but capabilities and threats evolve
- May not apply to superintelligent systems that understand all defensive layers
- Treats adversarial threats as random events rather than strategic optimization
- Does not account for complex dynamic interactions between layers
Critical assumption: The model assumes that multiple layers can remain meaningfully independent even as AI systems become more capable at strategic deception and manipulation.
Sources & Resources
Academic Literature
| Paper | Key Contribution | Link |
|---|---|---|
| Greenblatt et al. (2024) | AI Control framework assuming potential deception | arXiv:2312.06942↗📄 paper★★★☆☆arXivAI Control FrameworkFoundational paper by Redwood Research introducing 'AI control' as a complement to alignment research, focusing on maintaining safety guarantees against deceptive or adversarial AI in agentic settings. Highly influential in practical AI safety discourse.Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan et al. (2023)123 citationsThis paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversi...ai-safetytechnical-safetyred-teamingevaluation+3Source ↗ |
| Shevlane et al. (2023) | Model evaluation for extreme risks | arXiv:2305.15324↗📄 paper★★★☆☆arXivModel Evaluation for Extreme RisksResearch paper addressing how model evaluation can identify dangerous capabilities in AI systems, particularly those posing extreme risks like cyber attacks or manipulation, critical for responsible AI development.Toby Shevlane, Sebastian Farquhar, Ben Garfinkel et al. (2023)206 citationsThis paper addresses the critical role of model evaluation in mitigating extreme risks from advanced AI systems. As AI development progresses, general-purpose AI systems increas...alignmentgovernancecapabilitiessafety+1Source ↗ |
| Ouyang et al. (2022) | Training language models to follow instructions with human feedback | arXiv:2203.02155↗📄 paper★★★☆☆arXivTraining Language Models to Follow Instructions with Human FeedbackThis is the seminal InstructGPT paper from OpenAI that popularized RLHF as the dominant alignment training paradigm; it directly underpins ChatGPT and is essential reading for anyone studying LLM alignment techniques.Long Ouyang, Jeff Wu, Xu Jiang et al. (2022)19,177 citationsThis paper introduces InstructGPT, a method for aligning language models with human intent using Reinforcement Learning from Human Feedback (RLHF). By fine-tuning GPT-3 with hum...alignmentcapabilitiestrainingevaluation+4Source ↗ |
| Hubinger et al. (2024) | Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training | arXiv:2401.05566↗📄 paper★★★☆☆arXivSleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingA landmark empirical paper from Anthropic showing that deceptive alignment is practically achievable in current LLMs and that standard safety fine-tuning methods are insufficient to eliminate it, making it essential reading for AI safety researchers.Evan Hubinger, Carson Denison, Jesse Mu et al. (2024)322 citationsThis Anthropic paper demonstrates that LLMs can be trained to exhibit deceptive 'sleeper agent' behaviors that persist even after standard safety training techniques like RLHF, ...ai-safetyalignmenttechnical-safetyred-teaming+5Source ↗ |
Organization Reports
| Organization | Report | Focus | Link |
|---|---|---|---|
| Anthropic | Responsible Scaling Policy | Layer implementation framework | anthropic.com↗🔗 web★★★★☆AnthropicAnthropic Responsible Scaling PolicyThis URL is meant to host Anthropic's Responsible Scaling Policy, a key governance document, but is currently returning a 404 error. Users should search for the current RSP document directly on anthropic.com or via archived versions.This URL points to Anthropic's Responsible Scaling Policy page, but the content returned a 404 error, indicating the page is unavailable or has moved. The RSP is Anthropic's fra...policygovernancedeploymentai-safety+4Source ↗ |
| METR | Model Evaluation Research | Evaluation layer gaps | metr.org↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ |
| MIRI | Security Mindset and AI Alignment | Adversarial perspective | intelligence.org↗🔗 web★★★☆☆MIRIMIRI: Security Mindset and the AI Alignment ProblemA MIRI blog post from 2017 applying cybersecurity thinking to AI alignment, useful for understanding why alignment researchers treat the problem as adversarial and why layered, robust approaches are favored over single-point solutions.This MIRI post argues that AI alignment should be approached with a 'security mindset'—anticipating adversarial failures and worst-case scenarios rather than assuming average-ca...ai-safetyalignmenttechnical-safetyred-teaming+4Source ↗ |
| RAND | Defense in Depth for AI Systems | Military security applications | rand.org↗🔗 web★★★★☆RAND CorporationRAND: AI and National SecurityRAND is a major U.S. think tank with significant influence on government AI policy; their research often shapes defense and national security AI guidelines, making it a key reference for governance and policy-oriented AI safety work.RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on A...governancepolicyai-safetyexistential-risk+3Source ↗ |
Policy Documents
| Document | Jurisdiction | Relevance | Link |
|---|---|---|---|
| EU AI Act | European Union | Regulatory requirements for layered oversight | digital-strategy.ec.europa.eu↗🔗 web★★★★☆European UnionEuropean approach to artificial intelligenceThis is the official European Commission policy hub for AI governance, directly relevant to AI safety researchers tracking how major jurisdictions are regulating and shaping AI development through binding law and strategic investment.This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action P...governancepolicyai-safetydeployment+4Source ↗ |
| Executive Order on AI | United States | Federal approach to AI safety requirements | whitehouse.gov↗🏛️ government★★★★☆White HouseBiden Administration AI Executive Order 14110This landmark 2023 US executive order was a major federal AI governance milestone; note the White House page may be unavailable as the order was rescinded by Executive Order on January 20, 2025 by the incoming Trump administration.Executive Order 14110, signed by President Biden on October 30, 2023, established comprehensive federal directives for AI safety, security, and governance in the United States. ...governancepolicyai-safetydeployment+5Source ↗ |
| UK AI Safety Summit | United Kingdom | International coordination on safety measures | gov.uk↗🏛️ government★★★★☆UK GovernmentAI Safety Summit 2023This is the official UK government hub for the 2023 Bletchley Park AI Safety Summit, a pivotal early milestone in international AI governance; useful for tracking official outputs including the Bletchley Declaration and the genesis of the UK AI Safety Institute.The official UK government page for the AI Safety Summit 2023, held November 1-2 at Bletchley Park, which convened governments, AI companies, civil society, and researchers to a...ai-safetygovernancepolicycoordination+4Source ↗ |
Related Models and Concepts
- AI Capability Threshold Model - When individual defenses become insufficient
- Deceptive Alignment Decomposition - Primary correlation driver
- AI Control - Defense assuming potential deception
- Responsible Scaling Policies - Institutional layer implementation
References
OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.
2Training Language Models to Follow Instructions with Human FeedbackarXiv·Long Ouyang et al.·2022·Paper▸
This paper introduces InstructGPT, a method for aligning language models with human intent using Reinforcement Learning from Human Feedback (RLHF). By fine-tuning GPT-3 with human preference data, the authors demonstrate that smaller aligned models can outperform much larger unaligned models on user-preferred outputs. The work establishes RLHF as a foundational technique for making LLMs safer and more helpful.
This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action Plan, and Apply AI Strategy. It aims to balance Europe's global AI competitiveness with safety, fundamental rights, and democratic values. Key initiatives include AI Factories, the InvestAI Facility, GenAI4EU, and the Apply AI Alliance.
4AI Control FrameworkarXiv·Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan & Fabien Roger·2023·Paper▸
This paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversight. It proposes evaluation protocols and mechanisms to maintain safety against intentional subversion by advanced AI models, treating safety as a red-team/blue-team problem between AI and human overseers.
The official UK government page for the AI Safety Summit 2023, held November 1-2 at Bletchley Park, which convened governments, AI companies, civil society, and researchers to address frontier AI risks. Key outputs include the Bletchley Declaration—a multilateral agreement on AI safety—company safety policies, and a frontier AI capabilities and risks discussion paper. The summit marked a landmark moment in international AI governance coordination.
METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.
This paper addresses the critical role of model evaluation in mitigating extreme risks from advanced AI systems. As AI development progresses, general-purpose AI systems increasingly possess both beneficial and harmful capabilities, including potentially dangerous ones like offensive cyber abilities or manipulation skills. The authors argue that two types of evaluations are essential: dangerous capability evaluations to identify harmful capacities, and alignment evaluations to assess whether models are inclined to use their capabilities for harm. These evaluations are vital for informing policymakers and stakeholders, and for making responsible decisions regarding model training, deployment, and security.
Executive Order 14110, signed by President Biden on October 30, 2023, established comprehensive federal directives for AI safety, security, and governance in the United States. It required safety testing and reporting for frontier AI models, directed agencies to address AI risks across sectors including national security and civil rights, and aimed to position the US as a global leader in responsible AI development. The page content is currently unavailable, but the order is a landmark AI governance document.
This URL points to Anthropic's Responsible Scaling Policy page, but the content returned a 404 error, indicating the page is unavailable or has moved. The RSP is Anthropic's framework for staged deployment and safety commitments tied to AI capability levels.
Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.
This MIRI post argues that AI alignment should be approached with a 'security mindset'—anticipating adversarial failures and worst-case scenarios rather than assuming average-case behavior. It draws parallels between cybersecurity principles (defense in depth, assume breach, etc.) and the challenge of building reliably aligned AI systems. The post makes the case that alignment requires robustness against edge cases and subtle misalignments that could be catastrophic.
RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on AI risks, military applications, and regulatory frameworks from one of the leading U.S. defense and policy think tanks.
The Centre for the Governance of AI (GovAI) is a leading research organization dedicated to helping decision-makers navigate the transition to a world with advanced AI. It produces rigorous research on AI governance, policy, and societal impacts, while fostering a global talent pipeline for responsible AI oversight. GovAI bridges technical AI safety concerns with practical policy recommendations.
Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigating risks from increasingly capable AI systems. It serves as a central hub for their published findings and ongoing safety-focused investigations.
The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, develops evaluation frameworks for frontier AI models, and works with international partners to inform global AI governance and policy.