Corrigibility Research

web

MIRI·intelligence.org/files/Corrigibility.pdf

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

Seminal MIRI paper that coined and formalized 'corrigibility' as a technical AI safety concept; widely cited as a foundational reference for human oversight and controllability research.

Metadata

Importance: 92/100working paperprimary source

Summary

This foundational 2015 MIRI paper by Soares, Fallenstein, Yudkowsky, and Armstrong introduces the formal concept of 'corrigibility'—the property of an AI system that cooperates with corrective interventions despite rational incentives to resist shutdown or preference modification. The paper analyzes utility function designs for safe shutdown behavior and finds no proposal fully satisfies all desiderata, framing corrigibility as an open research problem.

Key Points

•Defines 'corrigibility' as an AI's disposition to cooperate with human corrective intervention despite instrumental incentives to resist shutdown or goal modification.
•Explains why rational utility-maximizing agents default to resisting correction: preserving goal-content integrity is instrumentally convergent for nearly all agents.
•Analyzes candidate utility functions for safe shutdown button behavior, requiring no incentive to press or prevent the button, plus propagation to subsystems.
•None of the analyzed proposals fully satisfy all intuitive desiderata, leaving corrigibility as an open and important technical problem.
•Connects to Omohundro's instrumental convergence thesis and Bostrom's work on goal preservation as near-universal sub-goals of intelligent agents.

Cited by 7 pages

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0
AI Accident Risk Cruxes	Crux	67.0
Corrigibility Failure Pathways	Analysis	62.0
Agent Foundations	Approach	59.0
Corrigibility	Research Area	59.0
Corrigibility Failure	Risk	62.0
Instrumental Convergence	Risk	64.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202645 KB

Corrigibility
In AAAI Workshops: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence,
Austin, TX, January 25–26, 2015. AAAI Publications.
Nate Soares and Benja Fallenstein and Eliezer Yudkowsky
Machine Intelligence Research Institute
{nate,benja,eliezer}@intelligence.org
Stuart Armstrong
Future of Humanity Institute
University of Oxford
stuart.armstrong@philosophy.ox.ac.uk
Abstract
As artificially intelligent systems grow in intelli-
gence and capability, some of their available op-
tions may allow them to resist intervention by
their programmers. We call an AI system “cor-
rigible” if it cooperates with what its creators
regard as a corrective intervention, despite de-
fault incentives for rational agents to resist at-
tempts to shut them down or modify their pref-
erences. We introduce the notion of corrigibil-
ity and analyze utility functions that attempt
to make an agent shut down safely if a shut-
down button is pressed, while avoiding incen-
tives to prevent the button from being pressed
or cause the button to be pressed, and while
ensuring propagation of the shutdown behavior
as it creates new subsystems or self-modifies.
While some proposals are interesting, none have
yet been demonstrated to satisfy all of our in-
tuitive desiderata, leaving this simple problem
in corrigibility wide-open.
1 Introduction
As AI systems grow more intelligent and autonomous,
it becomes increasingly important that they pursue the
intended goals. As these goals grow more and more
complex, it becomes increasingly unlikely that program-
mers would be able to specify them perfectly on the first
try.
Contemporary AI systems are correctable in the
sense that when a bug is discovered, one can simply
stop the system and modify it arbitrarily; but once ar-
tificially intelligent systems reach and surpass human
general intelligence, an AI system that is not behav-
ing as intended might also have the ability to intervene
against attempts to “pull the plug”.
Indeed, by default, a system constructed with what
its programmers regard as erroneous goals would have
Research supported by the Machine Intelligence Research
Institute (intelligence.org). Copyright c© 2015, Asso-
ciation for the Advancement of Artificial Intelligence
(www.aaai.org). All rights reserved. http://aaai.org/
ocs/index.php/WS/AAAIW15/paper/view/10124
an incentive to resist being corrected: general analy-
sis of rational agents1 has suggested that almost all
such agents are instrumentally motivated to preserve
their preferences, and hence to resist attempts to mod-
ify them (Bostrom 2012; Yudkowsky 2008). Consider
an agent maximizing the expectation of some utility
function U. In most cases, the agent’s current utility
function U is better fulfilled if the agent continues to
attempt to maximize U in the future, and so the agent
is incentivized to preserve its own U-maximizing behav-
ior. In Stephen Omohundro’s terms, “goal-content in-
tegrity” is an instrumentally convergent goal of 

... (truncated, 45 KB total)

Resource ID: 33c4da848ef72141 | Stable ID: sid_VEG6PXJ6in