Skip to content
Longterm Wiki
Updated 2026-03-22HistoryData
Page StatusContent
Edited 2 weeks ago2.6k words1 backlinksUpdated every 6 weeksDue in 4 weeks
64QualityGood •89ImportanceHigh91ResearchCritical
Content7/13
SummaryScheduleEntityEdit historyOverview
Tables17/ ~10Diagrams2/ ~1Int. links7/ ~21Ext. links26/ ~13Footnotes0/ ~8References3/ ~8Quotes0Accuracy0RatingsN:5.5 R:7 A:6.5 C:7.5Backlinks1
Issues2
QualityRated 64 but structure suggests 100 (underrated by 36 points)
Links6 links could use <R> components

Provable / Guaranteed Safe AI

Concept

Provable / Guaranteed Safe AI

Provable Safe AI uses formal verification to provide mathematical safety guarantees, with UK's ARIA investing £59M through 2028. Current verification handles ~10^6 parameters while frontier models exceed 10^12 (6 orders of magnitude gap), yielding 1-5% probability of dominance at TAI, with critical unsolved challenge of verifying world models match reality.

Related
Approaches
Formal Verification (AI Safety)
Capabilities
Neuro-Symbolic Hybrid Systems
Concepts
Dense TransformersHeavy Scaffolding / Agentic Systems
2.6k words · 1 backlinks

Quick Assessment

DimensionAssessmentEvidence
Technical MaturityEarly-stage (TRL 2-3)VNN-COMP 2024 verified networks with ≈10^6 parameters; frontier models have 10^12+
Funding CommitmentSubstantial (£59M+)UK ARIA's Safeguarded AI programme, 3-5 year timeline through 2028
Research CommunityGrowing coalition17 co-authors on foundational paper including Bengio, Russell, Tegmark, Szegedy
Scalability Gap6 orders of magnitudeCurrent verification handles ~1M parameters; GPT-4 class models have ≈1.7T parameters
Timeline to Competitiveness2032+ for general AIARIA targets Stage 3 by 2028; competitive general AI remains speculative
Adoption Likelihood1-5% at TAIStrong safety properties but severe capability-competitiveness uncertainty
Key BottleneckWorld model verificationProving formal models match physical reality is fundamentally undecidable in general
SourceLink
Official Websitealignmentforum.org
Wikipediaen.wikipedia.org
LessWronglesswrong.com
arXivarxiv.org

Overview

Provable or "Guaranteed Safe" AI refers to approaches that aim to build AI systems with mathematical safety guarantees rather than empirical safety measures. The core premise is that we should be able to prove that an AI system will behave safely, not merely hope based on testing.

The most prominent current effort is Davidad's Open Agency Architecture, funded through the UK's ARIA (Advanced Research + Invention Agency) programme. This approach represents a fundamentally different philosophy from mainstream AI development: safety through formal verification rather than alignment through training.

Estimated probability of being dominant at transformative AI: 1-5%. This is ambitious and faces significant scalability questions, but offers uniquely strong safety properties if achievable.

Formal Verification Approaches

The field of guaranteed safe AI encompasses multiple verification methodologies, each with different tradeoffs between expressiveness, scalability, and the strength of guarantees provided.

Diagram (loading…)
flowchart TB
  subgraph approaches["Guaranteed Safe AI Approaches"]
      direction TB
      subgraph formal["Formal Verification"]
          smt["SMT-Based<br/>(Satisfiability Modulo Theories)"]
          abstract["Abstract Interpretation"]
          bound["Bound Propagation<br/>(α,β-CROWN)"]
      end

      subgraph hybrid["Hybrid Approaches"]
          neuro["Neuro-Symbolic<br/>Integration"]
          shield["Verification-Guided<br/>Shielding"]
          runtime["Runtime<br/>Monitoring"]
      end

      subgraph world["World Model Approaches"]
          physics["Physics-Based<br/>Simulators"]
          causal["Causal<br/>Models"]
          probabilistic["Probabilistic<br/>Programs"]
      end
  end

  subgraph guarantees["Safety Guarantee Types"]
      local["Local Robustness<br/>(perturbation bounds)"]
      global["Global Properties<br/>(invariants)"]
      temporal["Temporal Properties<br/>(always/eventually)"]
  end

  formal --> local
  formal --> global
  hybrid --> temporal
  world --> global

  style approaches fill:#e8f4e8
  style guarantees fill:#fff3e0

Verification Methodology Comparison

ApproachScalabilityGuarantee StrengthCurrent CapabilityKey Tools
SMT-BasedLow (≈10^4 neurons)CompleteImage classifiersReluplex, Marabou
Bound PropagationMedium (≈10^6 neurons)Sound but incompleteCNNs, transformersα,β-CROWN, auto_LiRPA
Abstract InterpretationMediumOver-approximateSafety-critical systemsDeepPoly, PRIMA
Runtime MonitoringHighEmpiricalProduction systemsNNV 2.0, SafeRL
CompositionalPotentially highModular proofsResearch stageARIA programme

VNN-COMP 2024 Results

The 5th International Verification of Neural Networks Competition (VNN-COMP 2024) provides the most recent benchmarks for the field:

MetricVNN-COMP 2024Trend
Participating teams8Stable
Benchmark categories12 regular + 8 extendedExpanding
Top performerα,β-CROWNConsistent leader since 2021
Largest verified network≈10^6 parameters≈10x improvement since 2021
Average solve timeMinutes to hoursImproving 30-50% annually

Key finding: The best tools can now verify local robustness properties on networks with millions of parameters in reasonable time, but this remains 6 orders of magnitude below frontier model scale.

Conceptual Framework

Diagram (loading…)
flowchart TB
  subgraph world["World Model (Verified)"]
      physics["Physics Simulator"]
      state["State Representation"]
      predict["Prediction Engine"]
  end

  subgraph agent["Agent (Bounded)"]
      goals["Formal Goal Spec"]
      planner["Verified Planner"]
      bounds["Safety Bounds"]
  end

  subgraph verify["Verification Layer"]
      proofs["Safety Proofs"]
      monitor["Runtime Monitor"]
      audit["Audit Trail"]
  end

  goals --> planner
  planner --> world
  world --> predict
  predict --> bounds
  bounds --> proofs
  proofs --> monitor
  monitor --> |"Safe Action"| output["Output"]

Key Properties

PropertyRatingAssessment
White-box AccessHIGHDesigned for formal analysis from the ground up
TrainabilityDIFFERENTMay use verified synthesis, not just gradient descent
PredictabilityHIGHBehavior bounded by formal proofs
ModularityHIGHCompositional by design for modular proofs
Formal VerifiabilityHIGHThis is the core value proposition

ARIA's Safeguarded AI Programme

The UK's Advanced Research + Invention Agency (ARIA) has committed £59 millionaria.org.uksid_XqjV4mbMXQ.total-funding over 3-5 years to the Safeguarded AI programme, led by Programme Director David "davidad" Dalrymple with Scientific Director Yoshua Bengio (joined August 2024). This represents the largest government investment specifically targeting provably safe AI. In November 2025, ARIA announced a significant pivot: TA1's scope was expanded to broader "mathematical assurance and auditability"; TA2 Phase 2 (the £18M "AI Production Facility") was abandoned as frontier AI advances made dedicated ML capability development less valuable; and TA3 Phase 2 (£8.4M) was cancelled and replaced by a cybersecurity focus on formally-verified firewalls for critical infrastructure. Kathleen Fisher (former DARPA HACMS programme lead) is incoming as ARIA CEO.

The Guaranteed Safe AI Framework

The foundational paper "Towards Guaranteed Safe AI" (May 2024) was co-authored by 17 prominent researchers including Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Joe Halpern, Clark Barrett, Jeannette Wing, and Joshua Tenenbaum. It defines the core architecture:

Core Components

ComponentFunctionVerification ApproachCurrent Status
World ModelFormal representation of environment physicsMust be verifiably accurate for relevant domainsResearch phase; physics simulators exist but verification gap remains
Safety SpecificationMathematical statement of acceptable effectsFormal language being developed by ARIA TA 1.1Language spec targeted for 2025-2026
VerifierProvides auditable proof certificatesMust check frontier model outputs against specsArchitecture defined; implementation pending
Runtime MonitorChecks all actions against safety proofsBlocks any action that can't be verified safeDemonstrated in narrow domains (ACAS Xu)

ARIA Programme Structure

Technical AreaFocusFundingTimeline
TA 1.1Formal proof language development£3.5M (10-16 grants)April-May 2024 round completed
TA 2AI Production Facility£18M (abandoned)Abandoned in Nov 2025 pivot — frontier AI advances made dedicated ML capability development less valuable
TA 3Real-world applications£5.4M Phase 1 (continuing); Phase 2 cancelledPhase 2 (£8.4M) replaced by cybersecurity pivot targeting formally-verified firewalls for critical infrastructure

Key quote from davidad: "The frontier model needs to submit a proof certificate, which is something that is written in a formal language that we're defining in another part of the program... This new language for proofs will hopefully be easy for frontier models to generate and then also easy for a deterministic, human-audited algorithm to check."

Key Technical Challenges

ChallengeDifficultyCurrent StatusQuantified Gap
World model accuracyCRITICALCore unsolved problem - verifying model matches reality is undecidable in generalPhysics simulators achieve less than 1% error in constrained domains; open-world accuracy unknown
Specification completenessHIGHFormal specs may miss important values/constraintsEstimated 60-80% of human values may be difficult to formalize
Computational tractabilityHIGHVerification can be exponentially expensiveSAT solving is NP-complete; neural network verification is co-NP-complete
Scalability gapVERY HIGHCurrent tools verify ~10^6 parameters; frontier models have ≈10^126 orders of magnitude gap; closing at ≈10x per 3 years
Capability competitivenessHIGHUnknown if verified systems can match unverified onesNo competitive benchmarks yet exist
Handling uncertaintyMEDIUMFormal methods traditionally assume certaintyProbabilistic verification emerging but less mature

According to the Stanford Center for AI Safety whitepaper, the complexity and heterogeneity of AI systems means that formal verification of specifications is undecidable in general—even deciding whether a state of a linear hybrid system is reachable is undecidable.

Safety Implications

Advantages

AdvantageExplanation
Mathematical guaranteesIf proofs hold, safety follows logically, not probabilistically
Auditable by constructionEvery decision can be traced to verified components
Clear failure modesWhen something fails, you know which assumption was violated
No emergent deceptionBehavior is bounded by proofs, not learned tendencies
CompositionalityCan build larger safe systems from smaller verified components

Limitations and Risks

LimitationSeverityExplanation
May not scale to AGIHIGHUnknown if formal methods can handle AGI-level complexity
Capability taxHIGHVerified systems may be significantly less capable
World model verificationCRITICALThe world model must be correct; this is extremely hard
Specification gamingMEDIUMSystem optimizes formal spec, which may miss true intent
Implementation bugsMEDIUMThe implementation must match the formal model

Research Landscape

Key Organizations

OrganizationRoleFundingFocus
ARIA (UK)Primary funder of Safeguarded AI£59M committedEnd-to-end verified AI systems
Stanford Center for AI SafetyAcademic researchUniversity + grantsFormal methods for ML
George Mason ROARS LabVerification tool developmentNSF fundingNeuralSAT verifier (2nd place VNN-COMP 2024)
CMU + Berkeleyα,β-CROWN developmentDARPA, NSFLeading verification tool
MIRITheoretical foundationsPrivate donors (≈$1M/year)Agent foundations theory

Key People

PersonAffiliationContribution
David "davidad" DalrympleARIASafeguarded AI programme lead; GS AI framework author
Yoshua BengioMila/ARIAGS AI paper co-author; joined as ARIA Scientific Director in August 2024
Kathleen FisherARIA (incoming CEO)Former DARPA HACMS programme lead; incoming ARIA CEO
Stuart RussellUC BerkeleyGS AI co-author; provably beneficial AI advocate
Sanjit SeshiaUC BerkeleyVerified AI, scenic language for testing
Clark BarrettStanfordSMT solver development (CVC5)
Jeannette WingColumbiaTrustworthy AI initiative

Key Publications

PublicationYearAuthorsImpact
Towards Guaranteed Safe AI2024Dalrymple, Skalse, Bengio, Russell, Tegmark + 12 othersFoundational framework paper defining GS AI
VNN-COMP 2024 Report2024Brix, Bak, Johnson, WuState-of-the-art verification benchmarks
Safety Verification of DNNs2017Huang et al.First systematic DNN verification framework
ARIA Programme Thesis v1.22024DalrympleDetailed technical roadmap
Survey: LLM Safety via V&V2024Huang et al.Comprehensive review of verification for LLMs
Toward Verified AI2017Seshia, Sadigh, SastryCACM overview of challenges

Comparison with Other Approaches

AspectProvable SafeHeavy ScaffoldingStandard RLHF
Safety guaranteesSTRONG (if achievable)PARTIAL (scaffold only)WEAK (empirical)
Current capabilityLOWMEDIUMHIGH
ScalabilityUNKNOWNPROVENPROVEN
Development costVERY HIGHMEDIUMLOW
InterpretabilityHIGH by designMEDIUMLOW

Probability Assessment

Quantified Probability Factors

FactorEstimateReasoning
Technical feasibility by 203515-30%Scalability gap closing at ≈10x/3 years; still 3-4 doublings needed
Competitive capability if feasible20-40%Compositional approaches may enable capability parity
Adoption if competitive40-60%Regulatory pressure and liability concerns favor verified systems
Compound probability1.2-7.2%Product of above factors
Central estimate3-5%Accounting for correlation and unknown unknowns

Arguments for Higher Probability (greater than 5%)

  1. Strong safety incentives - As AI gets more powerful, demand for guarantees may increase; International AI Safety Report 2025 emphasizes need for mathematical guarantees in safety-critical domains
  2. Regulatory pressure - EU AI Act requires risk assessment for high-risk AI; formal verification may become compliance pathway
  3. ARIA funding - £59M+ committed with clear milestones; additional £110M in related ARIA funding in 2024
  4. Compositionality - Verified components could be combined with other approaches; modular verification enables scaling

Arguments for Lower Probability (less than 1%)

  1. Capability gap - Unverified systems may always be 10-100x more capable due to verification overhead
  2. Speed of AI progress - TAI may arrive by 2030-2035; verification likely not ready
  3. World model problem - Alignment Forum analysis shows verifying world models may be fundamentally impossible in open domains
  4. Economic incentives - Labs optimizing for capabilities face 10-30% performance tax from verification requirements

Key Uncertainties

  1. Can world models be verified? The entire approach depends on having provably accurate world models. This is the crux.

  2. What's the capability ceiling? We don't know if verified systems can be competitive with unverified ones.

  3. Can it handle novel situations? Formal verification typically requires known domains; how to handle unprecedented situations?

  4. Will anyone adopt it? Even if possible, will competitive pressures prevent adoption?

Implications for Safety Research

This Approach Changes the Game If...

  • World model verification is solved
  • Verified systems reach capability parity
  • Regulatory frameworks require formal safety proofs
  • A major AI accident increases demand for guarantees

Research That Contributes

  • Formal methods - Core verification techniques
  • World modeling - Understanding physical systems
  • Specification learning - Translating values to formal specs
  • Compositional verification - Building larger proofs from smaller ones

Timeline Considerations

MilestoneEstimated TimelineConfidenceKey Dependencies
ARIA TA1 proof language + expanded assurance2025-2026HIGHFunding secured; teams selected; scope expanded Nov 2025
AI Production FacilityAbandoned (Nov 2025)N/ATA2 Phase 2 cancelled; frontier AI advances made dedicated capability development unnecessary
Proof-of-concept (constrained domain)2026-2027MEDIUMWorld model quality for target domain
ARIA Stage 3 (integrated demo)2028MEDIUMPer ARIA roadmap; dependent on prior stages
Competitive narrow AI (e.g., grid control)2028-2032LOWRequires capability parity in domain
Competitive general AI2032+?VERY LOWFundamental scalability breakthrough needed
TAI-level verified systemUnknownSPECULATIVEMay require paradigm shift in verification

Historical Progress in Neural Network Verification

YearLargest Verified NetworkImprovementKey Advance
2017≈10^3 neuronsBaselineFirst DNN verification (Reluplex)
2019≈10^4 neurons10xComplete verification methods
2021≈10^5 neurons10xGPU-accelerated bound propagation
2024≈10^6 neurons10xα,β-CROWN optimizations
2027 (projected)≈10^7 neurons10xIf trend continues
2030 (projected)≈10^8 neurons10xStill 4 orders of magnitude from frontier

Critical observation: At the current rate of ~10x improvement every 3 years, verification tools will not reach frontier model scale until 2039-2042—potentially after transformative AI has already been developed via other paradigms.

Sources and Further Reading

References

1Anthropic 2024 paperWikipedia·Reference

A comprehensive Wikipedia overview of AI safety as an interdisciplinary field, covering its core components including AI alignment, risk monitoring, and robustness, as well as the policy landscape and institutional developments through 2023. The article surveys motivations ranging from near-term risks like bias and surveillance to speculative existential risks from AGI, and documents the field's rapid growth following generative AI advances.

★★★☆☆
2Towards Guaranteed Safe AIarXiv·David "davidad" Dalrymple et al.·2024·Paper

This paper introduces Guaranteed Safe (GS) AI, an approach to AI safety that aims to equip AI systems with high-assurance quantitative safety guarantees. The framework operates through three core components: a world model (mathematical description of how the AI system affects the world), a safety specification (mathematical description of acceptable effects), and a verifier (providing auditable proof that the AI satisfies the safety specification). The authors outline approaches for creating each component, discuss technical challenges, and argue for the necessity of this approach over alternative safety methods.

★★★☆☆
3International AI Safety Report 2025internationalaisafetyreport.org

A landmark international scientific assessment co-authored by 96 experts from 30 countries, providing a comprehensive overview of general-purpose AI capabilities, risks, and risk management approaches. It aims to establish shared scientific understanding across nations as a foundation for global AI governance. The report covers topics including capability evaluation, misuse risks, systemic risks, and mitigation strategies.

Related Wiki Pages

Top Related Pages

Approaches

Provably Safe AI (davidad agenda)

Organizations

LessWrong

Other

Yoshua BengioStuart RussellDavid Dalrymple

Key Debates

Technical AI Safety Research