Agent Foundations
Agent Foundations
Agent foundations research (MIRI's mathematical frameworks for aligned agency) faces low tractability after 10+ years with core problems unsolved, leading to MIRI's 2024 strategic pivot away from the field. Assessment shows ~15-25% probability the work is essential, 60-75% confidence in low tractability, with value 3-5x higher under long timeline assumptions.
Overview
Agent foundations research seeks to develop mathematical frameworks for understanding agency, goals, and alignment before advanced AI systems manifest the problems these frameworks aim to solve. The approach treats alignment as requiring theoretical prerequisites: formal definitions of concepts like "corrigibility," "embedded reasoning," and "value learning" that must be established before building reliably aligned systems.
This research program was pioneered by the Machine Intelligence Research Institute (MIRI) starting in 2013. Key contributions include embedded agency↗🔗 web★★★☆☆MIRIEmbedded Agency - Machine Intelligence Research InstituteThis is a foundational MIRI document outlining core technical obstacles to aligned AGI; essential reading for understanding the agent foundations research agenda and why classical decision/game theory is insufficient for reasoning about advanced AI systems.Embedded Agency by Abram Demski and Scott Garrabrant (2018, updated 2020) addresses foundational challenges for AI alignment arising from agents that are physically embedded wit...ai-safetyalignmenttechnical-safetydecision-theory+4Source ↗ (how agents reason about themselves within their environment), logical induction↗🔗 web★★★☆☆MIRINew paper: "Logical induction" - Machine Intelligence Research InstituteThis MIRI blog post announces the logical induction paper (Garrabrant et al., 2016), a foundational technical result in AI alignment research addressing how agents can reason under logical uncertainty—relevant to embedded agency, decision theory, and building provably well-behaved AI systems.MIRI announces the 'Logical Induction' paper, which introduces a computable algorithm for assigning probabilities to logical statements that updates beliefs over time in a coher...ai-safetytechnical-safetyalignmentdecision-theory+1Source ↗ (managing uncertainty about mathematical truths), functional decision theory↗🔗 web★★★☆☆MIRINew paper: "Functional Decision Theory" - Machine Intelligence Research InstituteThis MIRI announcement introduces the Yudkowsky & Soares FDT paper, relevant to AI alignment as agent foundations work on decision theory informs how rational AI agents should reason about their own decision-making processes.MIRI researchers Eliezer Yudkowsky and Nate Soares introduce Functional Decision Theory (FDT), a decision theory that treats one's decision as the output of a fixed mathematical...ai-safetyalignmenttechnical-safetydecision-theory+2Source ↗ (how agents should make decisions under uncertainty), and formal corrigibility analysis↗🔗 web★★★☆☆MIRICorrigibility ResearchSeminal MIRI paper that coined and formalized 'corrigibility' as a technical AI safety concept; widely cited as a foundational reference for human oversight and controllability research.This foundational 2015 MIRI paper by Soares, Fallenstein, Yudkowsky, and Armstrong introduces the formal concept of 'corrigibility'—the property of an AI system that cooperates ...ai-safetyalignmentcorrigibilitytechnical-safety+4Source ↗ (conditions under which agents accept correction).
The field faces a critical juncture. In January 2024, MIRI announced a strategic pivot away from agent foundations research, stating that "alignment research at MIRI and in the larger field had gone too slowly" and that they "now believed this research was extremely unlikely to succeed in time to prevent an unprecedented catastrophe." MIRI has shifted focus to policy and communications work, with their agent foundations team discontinued. This development has intensified debate about whether agent foundations represents essential groundwork or a research dead end.
Quick Assessment
| Dimension | Assessment | Confidence | Notes |
|---|---|---|---|
| Tractability | Low | Medium | MIRI's 2024 pivot suggests internal pessimism; key problems remain open after 10+ years |
| Value if alignment hard | High | Medium | May be necessary for robust solutions to deceptive alignment |
| Value if alignment easy | Low | High | Probably not needed if empirical methods suffice |
| Neglectedness | Very High | High | Fewer than 20 full-time researchers globally; MIRI discontinuation further reduces |
| Current momentum | Declining | High | Primary organization pivoted away; limited successor efforts |
| Theory-practice gap | Unknown | Low | Core question: will formalisms apply to neural networks? |
Evaluation by Worldview
| Worldview | Assessment | Reasoning |
|---|---|---|
| Doomer | Mixed | May need foundations for robust alignment, but pessimistic about tractability |
| Long timelines | Positive | More time for theoretical progress to compound |
| Optimistic | Low priority | Empirical methods likely sufficient |
| Governance-focused | Low priority | Policy interventions more tractable in relevant timeframes |
Core Research Areas
Agent foundations research spans several interconnected problem domains. The 2018 "Embedded Agency↗✏️ blog★★★☆☆Alignment ForumEmbedded Agency (Sequence)A key MIRI-affiliated research sequence; essential reading for understanding agent-foundations approaches to alignment, particularly the theoretical gaps in classical agent models that motivate much of MIRI's and Redwood's technical research agendas.A foundational sequence by Scott Garrabrant and Abram Demski examining the deep theoretical challenges that arise when AI agents are embedded within—rather than external to—the ...agent-foundationsdecision-theorytechnical-safetyalignment+4Source ↗" framework by Abram Demski and Scott Garrabrant unified these into a coherent research agenda, identifying four core subproblems of embedded agency: decision theory, embedded world-models, robust delegation, and subsystem alignment.
Diagram (loading…)
flowchart TD EA[Embedded Agency] --> DT[Decision Theory] EA --> EW[Embedded World-Models] EA --> RD[Robust Delegation] EA --> SA[Subsystem Alignment] DT --> FDT[Functional Decision Theory] DT --> LU[Logical Uncertainty] EW --> NF[Naturalized Reasoning] RD --> CORR[Corrigibility] RD --> VL[Value Learning] SA --> INNER[Inner Alignment] SA --> MESA[Mesa-Optimization] style EA fill:#e1f5fe style DT fill:#fff3e0 style EW fill:#fff3e0 style RD fill:#fff3e0 style SA fill:#fff3e0
Research Progress by Area
| Area | Core Question | Key Results | Status |
|---|---|---|---|
| Embedded agency | How can an agent reason about itself within the world? | Demski & Garrabrant (2018) framework identifies four subproblems | Conceptual clarity achieved; formal solutions lacking |
| Decision theory | How should AI make decisions under uncertainty? | FDT/UDT developed as alternatives to CDT/EDT; logical causality still undefined | Active research; practical relevance debated |
| Logical induction | How to have calibrated beliefs about math? | Garrabrant et al. (2016) algorithm achieves key desiderata | Major success; integration with decision theory incomplete |
| Corrigibility | Can we formally define "willing to be corrected"? | Soares et al. (2015) proved utility indifference approach has fundamental limits | Open problem; no complete solution |
| Value learning | How to learn values from observations? | CIRL (Hadfield-Menell et al. 2016) formalizes cooperative learning | Theoretical framework exists; practical applicability unclear |
Embedded Agency: The Central Framework
Traditional models of rational agents treat them as separate from their environment, able to model it completely and act on it from "outside." Real AI systems are embedded: they exist within the world they're modeling, are made of the same stuff as their environment, and cannot fully model themselves.
The embedded agency sequence↗✏️ blog★★★☆☆Alignment ForumEmbedded Agency (Sequence)A key MIRI-affiliated research sequence; essential reading for understanding agent-foundations approaches to alignment, particularly the theoretical gaps in classical agent models that motivate much of MIRI's and Redwood's technical research agendas.A foundational sequence by Scott Garrabrant and Abram Demski examining the deep theoretical challenges that arise when AI agents are embedded within—rather than external to—the ...agent-foundationsdecision-theorytechnical-safetyalignment+4Source ↗ identifies four interconnected challenges:
Decision Theory: How should an embedded agent make decisions when its decision procedure is part of the world being reasoned about? Standard decision theories assume the agent can consider counterfactual actions, but an embedded agent's reasoning is itself a physical process that affects outcomes.
Embedded World-Models: How can an agent maintain beliefs about a world that contains the agent itself? Self-referential reasoning creates paradoxes similar to Godel's incompleteness theorems.
Robust Delegation: How can an agent create successor agents that reliably pursue the original agent's goals? This connects to corrigibility and value learning.
Subsystem Alignment: How can an agent ensure its internal optimization processes remain aligned with its overall goals? This prefigured the mesa-optimization concerns that later became central to alignment discourse.
Logical Induction: A Landmark Result
Logical induction↗🔗 web★★★☆☆MIRINew paper: "Logical induction" - Machine Intelligence Research InstituteThis MIRI blog post announces the logical induction paper (Garrabrant et al., 2016), a foundational technical result in AI alignment research addressing how agents can reason under logical uncertainty—relevant to embedded agency, decision theory, and building provably well-behaved AI systems.MIRI announces the 'Logical Induction' paper, which introduces a computable algorithm for assigning probabilities to logical statements that updates beliefs over time in a coher...ai-safetytechnical-safetyalignmentdecision-theory+1Source ↗ (Garrabrant et al., 2016) is widely considered agent foundations' most significant concrete achievement. The algorithm provides a computable method for assigning probabilities to mathematical statements, analogous to how Solomonoff induction handles empirical uncertainty.
A logical inductor assigns probabilities to mathematical claims that:
- Converge to truth: Probabilities approach 1 for true statements, 0 for false ones
- Outpace deduction: Assigns high probability to provable claims before proofs are found
- Resist exploitation: No computationally bounded trader can profit systematically
This addresses a fundamental gap: Bayesian reasoning assumed logical omniscience, but real reasoners (including AI systems) must be uncertain about mathematical facts. Logical induction provides the first satisfactory theory for such uncertainty.
However, integration with decision theory remains incomplete. As Wei Dai and others have noted, the framework doesn't yet combine smoothly with updateless decision theory, limiting its practical applicability.
Corrigibility: An Open Problem
Corrigibility↗🔗 web★★★☆☆MIRICorrigibility ResearchSeminal MIRI paper that coined and formalized 'corrigibility' as a technical AI safety concept; widely cited as a foundational reference for human oversight and controllability research.This foundational 2015 MIRI paper by Soares, Fallenstein, Yudkowsky, and Armstrong introduces the formal concept of 'corrigibility'—the property of an AI system that cooperates ...ai-safetyalignmentcorrigibilitytechnical-safety+4Source ↗ (Soares et al., 2015) formalized the challenge of building AI systems that accept human correction. A corrigible agent should:
- Not manipulate its operators
- Preserve oversight mechanisms
- Follow shutdown instructions
- Ensure these properties transfer to successors
The paper proved that naive approaches fail. "Utility indifference"—giving equal utility to shutdown and continued operation—creates perverse incentives: if shutdown utility is too high, agents may engineer their own shutdown; if too low, they resist it.
No complete solution exists. Stuart Armstrong's work at the Future of Humanity Institute continued exploring corrigibility, but fundamental obstacles remain. The problem connects to instrumental convergence: almost any goal creates incentives to resist modification.
Decision Theory: FDT, UDT, and Open Questions
Agent foundations researchers developed novel decision theories to address problems with standard frameworks:
| Theory | Key Insight | Limitations |
|---|---|---|
| CDT (Causal) | Choose action with best causal consequences | Fails on Newcomb-like problems |
| EDT (Evidential) | Choose action with best correlated outcomes | Can recommend "managing the news" |
| FDT (Functional) | Choose as if choosing the output of your decision algorithm | Requires undefined "logical causation" |
| UDT (Updateless) | Pre-commit to policies before receiving observations | Doesn't integrate well with logical induction |
Functional Decision Theory (Yudkowsky & Soares, 2017) and Updateless Decision Theory (Wei Dai, 2009) were designed to handle cases where the agent's decision procedure is predicted by other agents or affects multiple instances of itself. These theories outperform CDT and EDT in specific scenarios but require technical machinery—particularly "logical counterfactuals"—that remains undefined.
The MIRI/Coefficient Giving exchange on decision theory↗✏️ blog★★★☆☆Alignment ForumMIRI/Open Philanthropy exchange on decision theoryThis exchange is a rare public record of institutional disagreement between MIRI and Open Philanthropy on decision theory, making it valuable for understanding the landscape of foundational agent-design debates in AI alignment research.Rob Bensinger (2021)This post documents a substantive dialogue between MIRI and Open Philanthropy researchers comparing decision theories (CDT, EDT, TDT, UDT, FDT) and their relevance to AI alignme...ai-safetyalignmenttechnical-safetydecision-theory+1Source ↗ reveals ongoing disagreement about whether this research direction is productive or has generated more puzzles than solutions.
The Core Debate: Agent Foundations vs. Empirical Safety
The most consequential disagreement in AI safety research methodology concerns whether theoretical foundations are necessary for alignment or whether empirical investigation of neural networks is more productive.
The Case for Agent Foundations
Proponents argue that alignment requires understanding agency at a fundamental level:
Theoretical prerequisites exist: Just as designing safe bridges requires understanding materials science, designing aligned AI may require understanding what goals, agency, and alignment formally mean. Ad hoc solutions may fail unpredictably on more capable systems.
Current methods won't scale: Techniques like RLHF work on current systems but may fail catastrophically on more capable ones. The sharp left turn hypothesis suggests capabilities may generalize while alignment doesn't—requiring foundational understanding of why.
Deceptive alignment is fundamental: Deceptive alignment and mesa-optimization may require formal analysis rather than empirical debugging. These failure modes are precisely about systems that pass empirical tests while remaining misaligned.
Conceptual clarity enables coordination: Even if formal solutions aren't achieved, precise problem definitions help researchers coordinate and avoid talking past each other.
The Case for Empirical Safety Research
Critics, including many who previously supported agent foundations, argue the approach has failed:
Slow progress on key problems: After 10+ years of work, corrigibility remains unsolved, decision theory has generated more puzzles than solutions, and logical induction doesn't integrate with the rest of the framework. MIRI's 2024 pivot implicitly acknowledges this.
Neural networks may not fit formalisms: Agent foundations research makes few assumptions about "gears-level properties" of AI systems. But modern AI is increasingly certain to be neural networks. Mechanistic interpretability↗📄 paper★★★☆☆arXivMechanistic Interpretability for AI Safety A ReviewA 2024 review paper that serves as a valuable entry point and reference for researchers wanting to understand how mechanistic interpretability connects to AI safety, synthesizing a rapidly growing literature into an organized framework.Leonard Bereska, Efstratios Gavves (2024)359 citationsA comprehensive review paper surveying mechanistic interpretability research and its relevance to AI safety, covering techniques for understanding neural network internals, curr...interpretabilityai-safetytechnical-safetyalignment+1Source ↗ and empirical safety research can assume neural network architectures and make faster progress.
The Science of Deep Learning argument: Researchers like Neel Nanda argue that understanding how neural networks actually work (through mechanistic interpretability, scaling laws, etc.) is more tractable than developing abstract theories. Neural network properties may generalize even if specific alignment techniques don't.
Theory-practice gap is uncertain: It's unknown whether theoretical results will apply to systems that emerge from gradient descent. The gap between "ideal agents" studied in agent foundations and actual neural networks may be unbridgeable.
Resolution Attempts
| Position | Probability | Implications |
|---|---|---|
| Agent foundations essential | 15-25% | Major research effort needed; current neglect is harmful |
| Agent foundations helpful but not necessary | 30-40% | Worth some investment; complements empirical work |
| Agent foundations irrelevant to actual AI | 25-35% | Resources better allocated to empirical safety |
| Agent foundations actively harmful (delays practical work) | 5-15% | Should be deprioritized or defunded |
Key Cruxes
Crux 1: Tractability
| Evidence for tractability | Evidence against |
|---|---|
| Logical induction solved a decades-old problem | Core problems (corrigibility, decision theory) remain open after 10+ years |
| Progress on conceptual clarity (embedded agency framework) | MIRI's 2024 pivot suggests insider pessimism |
| Problems are well-defined and mathematically precise | Precision hasn't translated to solutions |
| Small field—more researchers might help | Field has been trying to expand for years with limited success |
Current assessment: Low tractability (60-75% confidence). The difficulty appears intrinsic rather than due to resource constraints.
Crux 2: Theory-Practice Transfer
| Evidence for transfer | Evidence against |
|---|---|
| Mesa-optimization concepts from agent foundations now guide empirical research | Actual neural networks may not have the structure assumed by formalisms |
| Conceptual frameworks help identify what to look for empirically | Gradient descent produces systems very different from idealized agents |
| Mathematical guarantees could provide confidence | Assumptions required for guarantees may not hold in practice |
Current assessment: Uncertain (low confidence). This is the crux where evidence is weakest.
Crux 3: Timeline Sensitivity
| Short timelines (AGI by 2030) | Long timelines (AGI 2040+) |
|---|---|
| Agent foundations unlikely to produce results in time | More time for theoretical progress to compound |
| Must work with practical methods on current systems | Can develop foundations before crisis |
| MIRI's stated reason for pivoting | Theoretical work becomes higher-value investment |
Current assessment: Conditional value. Agent foundations' expected value is 3-5x higher under long timelines assumptions.
Who Should Work on This?
Good Fit Indicators
| Factor | Weight | Notes |
|---|---|---|
| Strong math/philosophy background | High | Research requires formal methods, logic, decision theory |
| Belief alignment is fundamentally hard | High | Must expect empirical methods to be insufficient |
| Long timeline assumptions | Medium | Affects expected value of theoretical work |
| Tolerance for slow, uncertain progress | High | Field has historically moved slowly |
| Interest in problems for their own sake | Medium | May need intrinsic motivation given uncertain practical payoff |
Career Considerations
Advantages: Very neglected (high marginal value if tractable); intellectually interesting problems; potential for high-impact breakthroughs if successful.
Disadvantages: Low tractability; uncertain practical relevance; declining institutional support (MIRI pivot); difficult to evaluate progress.
Alternatives to consider: Mechanistic interpretability, scalable oversight, AI control, governance research.
Sources and Further Reading
Primary Research
- Embedded Agency (Demski & Garrabrant, 2018): AI Alignment Forum sequence↗✏️ blog★★★☆☆Alignment ForumEmbedded Agency (Sequence)A key MIRI-affiliated research sequence; essential reading for understanding agent-foundations approaches to alignment, particularly the theoretical gaps in classical agent models that motivate much of MIRI's and Redwood's technical research agendas.A foundational sequence by Scott Garrabrant and Abram Demski examining the deep theoretical challenges that arise when AI agents are embedded within—rather than external to—the ...agent-foundationsdecision-theorytechnical-safetyalignment+4Source ↗ presenting the unified framework
- Logical Induction (Garrabrant et al., 2016): MIRI announcement↗🔗 web★★★☆☆MIRINew paper: "Logical induction" - Machine Intelligence Research InstituteThis MIRI blog post announces the logical induction paper (Garrabrant et al., 2016), a foundational technical result in AI alignment research addressing how agents can reason under logical uncertainty—relevant to embedded agency, decision theory, and building provably well-behaved AI systems.MIRI announces the 'Logical Induction' paper, which introduces a computable algorithm for assigning probabilities to logical statements that updates beliefs over time in a coher...ai-safetytechnical-safetyalignmentdecision-theory+1Source ↗
- Corrigibility (Soares et al., 2015): PDF↗🔗 web★★★☆☆MIRICorrigibility ResearchSeminal MIRI paper that coined and formalized 'corrigibility' as a technical AI safety concept; widely cited as a foundational reference for human oversight and controllability research.This foundational 2015 MIRI paper by Soares, Fallenstein, Yudkowsky, and Armstrong introduces the formal concept of 'corrigibility'—the property of an AI system that cooperates ...ai-safetyalignmentcorrigibilitytechnical-safety+4Source ↗
- Functional Decision Theory (Yudkowsky & Soares, 2017): MIRI announcement↗🔗 web★★★☆☆MIRINew paper: "Functional Decision Theory" - Machine Intelligence Research InstituteThis MIRI announcement introduces the Yudkowsky & Soares FDT paper, relevant to AI alignment as agent foundations work on decision theory informs how rational AI agents should reason about their own decision-making processes.MIRI researchers Eliezer Yudkowsky and Nate Soares introduce Functional Decision Theory (FDT), a decision theory that treats one's decision as the output of a fixed mathematical...ai-safetyalignmenttechnical-safetydecision-theory+2Source ↗
- Agent Foundations Technical Agenda (Soares & Fallenstein, 2017): PDF↗🔗 web★★★☆☆MIRIMIRI Technical Agenda: Agent Foundations for Aligning Machine Intelligence with Human InterestsPublished by MIRI (Machine Intelligence Research Institute), this agenda was highly influential in shaping early technical AI safety research and represents a formal-methods, agent-theoretic approach to alignment distinct from more recent machine learning-focused paradigms.This document outlines MIRI's research agenda focused on foundational mathematical and logical problems in AI alignment, particularly around building provably safe and reliably ...ai-safetyalignmenttechnical-safetyexistential-risk+5Source ↗
Strategic and Meta-Level
- MIRI 2024 Mission and Strategy Update: MIRI blog↗🔗 web★★★☆☆MIRIMIRI's 2024 assessmentThis is MIRI's official strategic update under new CEO Malo Bourgon, marking a significant organizational pivot away from technical research toward policy and communications advocacy, reflecting deep pessimism about near-term alignment solutions.MIRI's new CEO Malo Bourgon outlines a strategic shift in 2024, prioritizing policy advocacy and communications over technical research, driven by extreme pessimism about solvin...ai-safetyexistential-riskpolicygovernance+3Source ↗ explaining pivot away from agent foundations
- MIRI/Coefficient Giving Exchange on Decision Theory: AI Alignment Forum↗✏️ blog★★★☆☆Alignment ForumMIRI/Open Philanthropy exchange on decision theoryThis exchange is a rare public record of institutional disagreement between MIRI and Open Philanthropy on decision theory, making it valuable for understanding the landscape of foundational agent-design debates in AI alignment research.Rob Bensinger (2021)This post documents a substantive dialogue between MIRI and Open Philanthropy researchers comparing decision theories (CDT, EDT, TDT, UDT, FDT) and their relevance to AI alignme...ai-safetyalignmenttechnical-safetydecision-theory+1Source ↗
- Science of Deep Learning vs. Agent Foundations: LessWrong post↗✏️ blog★★★☆☆LessWrongScience of Deep Learning More Tractably Addresses the Sharp Left Turn Than Agent FoundationsThis post contributes to the debate over which research paradigms are best suited to handle discontinuous capability jumps, comparing the tractability-robustness tradeoffs of Agent Foundations versus the emerging Science of Deep Learning approach.NickGabs (2023)22 karma · 2 commentsThis post argues that the Science of Deep Learning (SDL) is better positioned than Agent Foundations to address the 'sharp left turn' alignment threat, because SDL makes the tra...ai-safetyalignmenttechnical-safetycapabilities+2Source ↗ arguing for empirical approaches
Value Learning and CIRL
- Cooperative Inverse Reinforcement Learning (Hadfield-Menell et al., 2016): arXiv↗📄 paper★★★☆☆arXivHadfield-Menell et al. (2016)Foundational work on value alignment that formalizes the alignment problem as cooperative inverse reinforcement learning, proposing a framework where AI systems learn human values through interaction rather than explicit specification.Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel et al. (2016)724 citationsThis paper formalizes the value alignment problem in autonomous systems as Cooperative Inverse Reinforcement Learning (CIRL), where a robot and human jointly maximize the human'...alignmentinverse-reinforcement-learningvalue-learningassistance-gamesSource ↗ - Formalizes value learning as cooperative game
- Inverse Reinforcement Learning: AI Alignment Forum wiki↗✏️ blog★★★☆☆Alignment ForumAI Alignment Forum wikiThis is a living wiki entry on the AI Alignment Forum providing a conceptual overview of IRL and linking to related technical discussions; useful as a starting point but not a deep technical treatment.A wiki entry defining Inverse Reinforcement Learning (IRL) as a technique where AI systems infer reward functions and agent preferences by observing behavior, rather than being ...alignmenttechnical-safetyai-safetyevaluationSource ↗
Related Concepts
- Mechanistic Interpretability - Empirical alternative to theoretical foundations
- Mesa-Optimization - Risk identified through agent foundations research
- Deceptive Alignment - Core problem agent foundations aims to address
References
1Mechanistic Interpretability for AI Safety A ReviewarXiv·Leonard Bereska & Efstratios Gavves·2024·Paper▸
A comprehensive review paper surveying mechanistic interpretability research and its relevance to AI safety, covering techniques for understanding neural network internals, current progress, and open challenges. The paper systematically connects interpretability methods to concrete safety applications and identifies gaps in the field.
MIRI researchers Eliezer Yudkowsky and Nate Soares introduce Functional Decision Theory (FDT), a decision theory that treats one's decision as the output of a fixed mathematical function optimizing for best outcomes. FDT outperforms both Causal Decision Theory (CDT) and Evidential Decision Theory (EDT) across canonical decision-theoretic problems including Newcomb's Problem, the Smoking Lesion Problem, and Parfit's Hitchhiker. The paper provides formal definitions, philosophical justifications, and comparative analysis of all three decision theories.
Embedded Agency by Abram Demski and Scott Garrabrant (2018, updated 2020) addresses foundational challenges for AI alignment arising from agents that are physically embedded within the world they model and act upon, rather than idealized external observers. It systematically explores four core problems: decision theory for embedded agents, building accurate world-models when the agent is part of the world, robust delegation between agents, and stable self-improvement. The work serves as a key reference document for MIRI's agent foundations research agenda.
This foundational 2015 MIRI paper by Soares, Fallenstein, Yudkowsky, and Armstrong introduces the formal concept of 'corrigibility'—the property of an AI system that cooperates with corrective interventions despite rational incentives to resist shutdown or preference modification. The paper analyzes utility function designs for safe shutdown behavior and finds no proposal fully satisfies all desiderata, framing corrigibility as an open research problem.
MIRI's new CEO Malo Bourgon outlines a strategic shift in 2024, prioritizing policy advocacy and communications over technical research, driven by extreme pessimism about solving AI alignment in time to prevent human extinction. MIRI now focuses on pushing for international governmental agreements to halt progress toward smarter-than-human AI, while maintaining a reduced research portfolio.
A wiki entry defining Inverse Reinforcement Learning (IRL) as a technique where AI systems infer reward functions and agent preferences by observing behavior, rather than being given explicit rewards. IRL is positioned as a key approach to AI alignment by enabling systems to learn human values through demonstration. The entry serves as a reference hub linking to related Alignment Forum posts and discussions.
7Hadfield-Menell et al. (2016)arXiv·Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel & Stuart Russell·2016·Paper▸
This paper formalizes the value alignment problem in autonomous systems as Cooperative Inverse Reinforcement Learning (CIRL), where a robot and human jointly maximize the human's unknown reward function through cooperation. Unlike classical IRL where the human acts in isolation, CIRL enables optimal behaviors including active teaching, active learning, and communication that facilitate value alignment. The authors prove that individual optimality is suboptimal in cooperative settings, reduce CIRL to POMDP solving, and provide an approximate algorithm for computing optimal joint policies.
This document outlines MIRI's research agenda focused on foundational mathematical and logical problems in AI alignment, particularly around building provably safe and reliably aligned AI agents. It identifies key technical challenges including logical uncertainty, decision theory, embedded agency, and value learning as core unsolved problems prerequisite to building trustworthy advanced AI systems.
A foundational sequence by Scott Garrabrant and Abram Demski examining the deep theoretical challenges that arise when AI agents are embedded within—rather than external to—the environments they reason about. It addresses decision theory, world-modeling, and alignment under the realistic condition that an agent is itself a physical subsystem of the world it must model and act upon.
MIRI announces the 'Logical Induction' paper, which introduces a computable algorithm for assigning probabilities to logical statements that updates beliefs over time in a coherent way. The framework addresses a key limitation of classical probability theory by handling uncertainty about mathematical and logical facts, not just empirical ones. This work is foundational to MIRI's research agenda on building reasoning agents that can think reliably about their own beliefs and actions.
11Science of Deep Learning More Tractably Addresses the Sharp Left Turn Than Agent FoundationsLessWrong·NickGabs·2023·Blog post▸
This post argues that the Science of Deep Learning (SDL) is better positioned than Agent Foundations to address the 'sharp left turn' alignment threat, because SDL makes the tractable assumption that future AGI will be neural-network-based. By uncovering general principles of how neural networks work, SDL can achieve both robustness and tractability, whereas Agent Foundations' minimal assumptions make it hard to produce actionable insights.
This post documents a substantive dialogue between MIRI and Open Philanthropy researchers comparing decision theories (CDT, EDT, TDT, UDT, FDT) and their relevance to AI alignment. The exchange focuses on whether updateless decision theories outperform updateful variants on key philosophical dilemmas such as counterfactual mugging and Troll Bridge. It serves as a useful reference for understanding where these organizations agree and disagree on foundational decision-theoretic questions.