MIRI Technical Agenda: Agent Foundations for Aligning Machine Intelligence with Human Interests

web

MIRI·intelligence.org/files/TechnicalAgenda.pdf

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

Published by MIRI (Machine Intelligence Research Institute), this agenda was highly influential in shaping early technical AI safety research and represents a formal-methods, agent-theoretic approach to alignment distinct from more recent machine learning-focused paradigms.

Metadata

Importance: 78/100organizational reportprimary source

Summary

This document outlines MIRI's research agenda focused on foundational mathematical and logical problems in AI alignment, particularly around building provably safe and reliably aligned AI agents. It identifies key technical challenges including logical uncertainty, decision theory, embedded agency, and value learning as core unsolved problems prerequisite to building trustworthy advanced AI systems.

Key Points

•Identifies foundational agent theory problems (logical uncertainty, decision theory, Vingean reflection) that must be solved before advanced AI can be reliably aligned.
•Argues that current AI development lacks the formal foundations needed to build systems whose behavior can be mathematically verified or guaranteed.
•Proposes research on 'naturalized induction' to handle agents embedded in the environments they reason about, a key gap in classical AI theory.
•Addresses the problem of utility indifference and corrigibility — ensuring advanced AI systems remain responsive to human correction and oversight.
•Frames alignment as requiring new mathematics, not just engineering fixes, emphasizing the need for pre-paradigmatic foundational research.

Cited by 1 page

Page	Type	Quality
Agent Foundations	Approach	59.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202668 KB

# Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda

In The Technological Singularity: Managing the Journey. Springer. 2017

Nate Soares and Benya Fallenstein Machine Intelligence Research Institute {nate,benya}@intelligence.org

# Contents

1 Introduction 1

1.1 Why These Problems? 2

2 Highly Reliable Agent Designs 3

2.2 2.1 Realistic World-Models . Decision Theory 456

2.3 Logical Uncertainty .

2.4 Vingean Reflection 7

3 Error-Tolerant Agent Designs 8

4 Value Specification 9

5 Discussion 1 1

5.1 Toward a Formal Understanding of the Problem 11

5.2 Why Start Now? . 11

# 1 Introduction

The property that has given humans a dominant advantage over other species is not strength or speed, but intelligence. If progress in artificial intelligence continues unabated, AI systems will eventually exceed humans in general reasoning ability. A system that is “superintelligent” in the sense of being “smarter than the best human brains in practically every field” could have an enormous impact upon humanity (Bostrom 2014). Just as human intelligence has allowed us to develop tools and strategies for controlling our environment, a superintelligent system would likely be capable of developing its own tools and strategies for exerting control (Muehlhauser and Salamon 2012). In light of this potential, it is essential to use caution when developing AI systems that can exceed human levels of general intelligence, or that can facilitate the creation of such systems.

Since artificial agents would not share our evolutionary history, there is no reason to expect them to be driven by human motivations such as lust for power. However, nearly all goals can be better met with more resources (Omohundro 2008). This suggests that, by default, superintelligent agents would have incentives to acquire resources currently being used by humanity. (Just as artificial agents would not automatically acquire a lust for power, they would not automatically acquire a human sense of fairness, compassion, or conservatism.) Thus, most goals would put the agent at odds with human interests, giving it incentives to deceive or manipulate its human operators and resist interventions designed to change or debug its behavior (Bostrom 2014, chap. 8).

Care must be taken to avoid constructing systems that exhibit this default behavior. In order to ensure that the development of smarter-than-human intelligence has a positive impact on the world, we must meet three formidable challenges: How can we create an agent that will reliably pursue the goals it is given? How can we formally specify beneficial goals? And how can we ensure that this agent will assist and cooperate with its programmers as they improve its design, given that mistakes in early AI systems are inevitable?

This agenda discusses technical research that is tractable today, which the authors think will make it easier to confront these three challenges in the future. Sections 

... (truncated, 68 KB total)

Resource ID: b781192f2704fdf4 | Stable ID: sid_6G0AgI3OSJ