QualityGoodQuality: 65/100Human-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 93
72
ImportanceHighImportance: 72/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.
14
Structure14/15Structure: 14/15Automated score based on measurable content features.Word count2/2Tables3/3Diagrams1/2Internal links2/2Citations3/3Prose ratio2/2Overview section1/1
21TablesData tables in the page1DiagramsCharts and visual diagrams19Internal LinksLinks to other wiki pages0FootnotesFootnote citations [^N] with sources17External LinksMarkdown links to outside URLs%3%Bullet RatioPercentage of content in bullet lists
Davidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and values, primarily funded by ARIA's £59M Safeguarded AI programme. The approach faces extreme technical challenges (world modeling, value specification) with uncertain tractability but would provide very high effectiveness if successful, addressing misalignment, deception, and power-seeking through proof-based constraints.
Issues1
QualityRated 65 but structure suggests 93 (underrated by 28 points)
Provably Safe AI (davidad agenda)
Approach
Provably Safe AI (davidad agenda)
Davidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and values, primarily funded by ARIA's £59M Safeguarded AI programme. The approach faces extreme technical challenges (world modeling, value specification) with uncertain tractability but would provide very high effectiveness if successful, addressing misalignment, deception, and power-seeking through proof-based constraints.
Related
Approaches
Formal Verification (AI Safety)ApproachFormal Verification (AI Safety)Formal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) and frontier models (~1.7T parameters). While offeri...Quality: 65/100Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
2.3k words · 1 backlinks
Overview
Provably Safe AI represents one of the most ambitious research agendas in AI safety: designing advanced AI systems where safety is guaranteed by mathematical construction rather than verified through empirical testing. Led primarily by David "davidad" Dalrymple---the youngest-ever MIT graduate degree recipient at age 16, former Filecoin co-inventor, and current Programme Director at ARIA---this agenda seeks to create AI systems whose safety properties can be formally proven before deployment, potentially enabling safe development even of superintelligent systems.
The foundational paper Towards Guaranteed Safe AI↗📄 paper★★★☆☆arXivTowards Guaranteed Safe AIDavid "davidad" Dalrymple, Joar Skalse, Yoshua Bengio et al. (2024)safetyinner-alignmentdistribution-shiftcapability-generalizationSource ↗, published in May 2024 by Dalrymple, Joar Skalse, and co-authors including Yoshua BengioPersonYoshua BengioComprehensive biographical overview of Yoshua Bengio's transition from deep learning pioneer (Turing Award 2018) to AI safety advocate, documenting his 2020 pivot at Mila toward safety research, co...Quality: 39/100, Stuart RussellPersonStuart RussellStuart Russell is a UC Berkeley professor who founded CHAI in 2016 with $5.6M from Coefficient Giving (then Open Philanthropy) and authored 'Human Compatible' (2019), which proposes cooperative inv...Quality: 30/100, Max TegmarkPersonMax TegmarkComprehensive biographical profile of Max Tegmark covering his transition from cosmology to AI safety advocacy, his role founding the Future of Life Institute, and his controversial Mathematical Un...Quality: 63/100, and Sanjit Seshia, introduces the core framework of "guaranteed safe (GS) AI." The approach requires three components: a world model (mathematical description of how the AI affects the outside world), a safety specification (mathematical description of acceptable effects), and a verifier (auditable proof certificate that the AI satisfies the specification relative to the world model).
The core insight motivating this work is that current AI safety approaches fundamentally rely on empirical testing and behavioral observation, which may be insufficient for systems capable of strategic deception or operating in novel circumstances. If we could instead construct AI systems where safety properties are mathematically guaranteed, we could have much stronger assurance that these properties will hold regardless of the system's capabilities or intentions. This parallels how we trust airplanes because their structural properties are verified through engineering analysis, not just flight testing. Crucially, this approach does not require interpretability to be solved and could still rule out deceptive alignment.
The agenda faces extraordinary technical challenges. It requires solving multiple hard problems including: constructing formal world models that accurately represent relevant aspects of reality, specifying human values in mathematically precise terms, building AI systems that can reason within these formal frameworks, and proving that the resulting systems satisfy desired safety properties. Many researchers consider one or more of these components to be fundamentally intractable. However, proponents argue that even partial progress could provide valuable safety assurances, and that the potential payoff justifies significant research investment.
Risk Assessment & Impact
Dimension
Assessment
Evidence
Timeline
Safety Uplift
Critical (if works)
Would provide mathematical safety guarantees
Long-term (10+ years)
Capability Uplift
Tax
Constraints likely reduce capabilities
Expected
Net World Safety
Helpful
Best-case: solves alignment; current: research only
Long-term
Lab Incentive
Weak
Very long-term; no near-term commercial value
Current
Research Investment
$10-50M/yr
ARIA funding; some academic work
Current
Current Adoption
None
Research only; ARIA programme
Current
How It Works
The Guaranteed Safe AI framework operates through a verification pipeline that checks every action before execution:
Loading diagram...
The key insight is that the proof checker can be much simpler than the proof generator. While generating proofs of safety may require sophisticated AI reasoning, verifying those proofs can be done by a small, formally verified checker---similar to how a calculator can verify a proof even if generating it requires human creativity.
The davidad Agenda
Core Components
Component
Description
Challenge Level
Formal World Model
Mathematical representation of relevant reality
Extreme
Value Specification
Precise formal statement of human values/safety
Extreme
Bounded Agent
AI system operating within formal constraints
High
Proof Generation
AI produces proofs that actions are safe
Very High
Proof Verification
Independent checking of safety proofs
Medium
Key Technical Requirements
Requirement
Description
Current Status
World Model Expressiveness
Must capture relevant aspects of reality
Theoretical
World Model Accuracy
Model must correspond to actual world
Unknown how to guarantee
Value Completeness
Must specify all relevant values
Philosophical challenge
Proof Tractability
Proofs must be feasible to generate
Open question
Verification Independence
Checker must be simpler than prover
Possible in principle
ARIA Programme Structure
ARIA's Safeguarded AI programme is the primary funder of provably safe AI research globally. The programme aims to develop quantitative safety guarantees for AI "in the way we have come to expect for nuclear power and passenger aviation."
Proponents point to the remarkable success of formal verification in other domains. CompCert, a formally verified C compiler, was tested for six CPU-years by the Csmith random testing tool without finding a single wrong-code error---the only compiler tested for which this was true. seL4, a formally verified operating system microkernel, has machine-checked mathematical proofs for Arm, RISC-V, and Intel architectures and is used in aerospace, autonomous aviation, and IoT platforms. These demonstrate that formal verification can scale to real systems.
Argument
Support
Proponents
Formal methods have scaled before
CompCert, seL4 showed scaling possible
Formal methods community
AI can help with proofs
Automated theorem proving advancing rapidly
AI researchers
Partial solutions valuable
Even limited guarantees help
davidad
Alternative to behavioral safety
Fundamentally different approach
Safety researchers
Skeptical View
Concern
Argument
Proponents
World model problem
Can't formally specify reality
Many researchers
Value specification problem
Human values resist formalization
Philosophers
Capability tax
Constraints make systems unusable
Capability researchers
Verification complexity
Proofs may be intractable
Complexity theorists
Key Uncertainties
Uncertainty
Range of Views
Critical For
World model feasibility
Possible to impossible
Entire agenda
Value specification feasibility
Possible to impossible
Entire agenda
Capability preservation
Minimal to prohibitive tax
Practical deployment
Proof tractability
Feasible to intractable
Runtime safety
Comparison with Related Approaches
Approach
Similarity
Key Difference
Formal Verification (AI Safety)ApproachFormal Verification (AI Safety)Formal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) and frontier models (~1.7T parameters). While offeri...Quality: 65/100
Both use proofs
Provably safe designs from scratch
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100
Both specify rules
Mathematical vs natural language
AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100
Both aim for safety
Internal guarantees vs external constraints
InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Both seek understanding
Formal specification vs empirical analysis
Scalability Assessment
Dimension
Assessment
Rationale
Technical Scalability
Unknown
Core question of the research agenda
Deception Robustness
Strong (by design)
Proofs rule out deception
SI Readiness
Yes (if works)
Designed for this; success uncertain
Capability Scalability
Unknown
Capability tax may be prohibitive
Quick Assessment
Dimension
Rating
Notes
Tractability
Low
Core components (world modeling, value specification) may be fundamentally intractable; requires major breakthroughs
Scalability
Unknown
If solved, would apply to arbitrarily capable systems by design
Current Maturity
Early Research
No working prototypes; theoretical frameworks only
Time Horizon
10-20+ years
Long-term moonshot research program
Key Proponents
ARIA, davidad
£59M committed; some academic groups
Effectiveness if Achieved
Very High
Would provide mathematical guarantees against misalignment and deception
Risks Addressed
If successful, provably safe AI would provide mathematical guarantees against several core AI safety risks:
Risk
Relevance
How It Helps
Misalignment PotentialAi Transition Model FactorMisalignment PotentialThe aggregate risk that AI systems pursue goals misaligned with human values—combining technical alignment challenges, interpretability gaps, and oversight limitations.
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
High
Proofs preclude deception by construction---no gap between training and deployment behavior
Power-Seeking AIRiskPower-Seeking AIFormal proofs demonstrate optimal policies seek power in MDPs (Turner et al. 2021), now empirically validated: OpenAI o3 sabotaged shutdown in 79% of tests (Palisade 2025), and Claude 3 Opus showed...Quality: 67/100
High
Actions mathematically constrained to proven-safe subset; cannot acquire unauthorized resources
Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100
High
Formal world model and specification prevent unintended optimization targets
SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Medium
Bounded agent cannot pursue hidden agendas if all actions must pass verification
Current State
As of early 2025, provably safe AI remains in the early research phase with no working prototypes:
Milestone
Status
Details
Foundational Theory
Published
Towards Guaranteed Safe AI↗📄 paper★★★☆☆arXivTowards Guaranteed Safe AIDavid "davidad" Dalrymple, Joar Skalse, Yoshua Bengio et al. (2024)safetyinner-alignmentdistribution-shiftcapability-generalizationSource ↗ outlines the framework (May 2024)
ARIA Funding
Active
£59M programme launched; TA1 Phase 2 opens June 2025
No demonstrations of GS AI components on realistic systems
The International Neural Network Verification Competition represents related work on verifying properties of neural networks, with tools like α,β-CROWN achieving significant speedups over traditional algorithms. However, this work addresses a narrower problem than full GS AI.
Limitations
May Be Impossible: Core components may be fundamentally intractable
Capability Tax: Constraints may make systems unusable
World Model Problem: Cannot formally specify reality comprehensively
Value Specification Problem: Human values resist precise formalization
Long Timeline: Decades of research may be needed
Alternative Approaches May Suffice: Empirical safety might be enough
Verification Complexity: Proofs may be too expensive to generate
Open Questions
Question
Importance
Status
Can world models be expressive enough?
Critical
Active research; no consensus
How do we specify human values formally?
Critical
Philosophical and technical challenge
Can proofs scale to complex actions?
High
Unknown; depends on proof automation advances
What capability tax is acceptable?
High
Depends on risk level and alternatives
Can AI help generate its own safety proofs?
High
Promising but creates bootstrap problem
What partial guarantees are valuable?
Medium
May enable incremental progress
Sources & Resources
Key Documents
Document
Author
Contribution
Towards Guaranteed Safe AI↗📄 paper★★★☆☆arXivTowards Guaranteed Safe AIDavid "davidad" Dalrymple, Joar Skalse, Yoshua Bengio et al. (2024)safetyinner-alignmentdistribution-shiftcapability-generalizationSource ↗
Dalrymple et al. (2024)
Foundational agenda paper with Bengio, Russell, Tegmark, Seshia
Provably safe AI affects the Ai Transition Model through fundamental safety guarantees:
Factor
Parameter
Impact
Alignment RobustnessAi Transition Model ParameterAlignment RobustnessThis page contains only a React component import with no actual content rendered in the provided text. Cannot assess importance or quality without the actual substantive content.
Verification strength
Would provide mathematical alignment guarantees
Misalignment PotentialAi Transition Model FactorMisalignment PotentialThe aggregate risk that AI systems pursue goals misaligned with human values—combining technical alignment challenges, interpretability gaps, and oversight limitations.
Alignment approach
Eliminates misalignment by construction
Safety-Capability GapAi Transition Model ParameterSafety-Capability GapThis page contains no actual content - only a React component reference that dynamically loads content from elsewhere in the system. Cannot evaluate substance, methodology, or conclusions without t...
Gap closure
Verified systems have provable safety properties
The provably safe AI agenda represents a moonshot approach that could fundamentally solve AI safety if successful. The extreme difficulty of the technical challenges means success is uncertain, but the potential value of even partial progress justifies significant research investment. This approach is particularly valuable as a complement to empirical safety work, providing a fundamentally different path to safe AI.
Eliciting Latent Knowledge (ELK)ApproachEliciting Latent Knowledge (ELK)Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded $274K, but $50K and $100K prizes remain unclai...Quality: 91/100
Concepts
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100Power-Seeking AIRiskPower-Seeking AIFormal proofs demonstrate optimal policies seek power in MDPs (Turner et al. 2021), now empirically validated: OpenAI o3 sabotaged shutdown in 79% of tests (Palisade 2025), and Claude 3 Opus showed...Quality: 67/100Formal Verification (AI Safety)ApproachFormal Verification (AI Safety)Formal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) and frontier models (~1.7T parameters). While offeri...Quality: 65/100AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100