Skip to content
Longterm Wiki
Back

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

Seminal MIRI paper that coined and formalized 'corrigibility' as a technical AI safety concept; widely cited as a foundational reference for human oversight and controllability research.

Metadata

Importance: 92/100working paperprimary source

Summary

This foundational 2015 MIRI paper by Soares, Fallenstein, Yudkowsky, and Armstrong introduces the formal concept of 'corrigibility'—the property of an AI system that cooperates with corrective interventions despite rational incentives to resist shutdown or preference modification. The paper analyzes utility function designs for safe shutdown behavior and finds no proposal fully satisfies all desiderata, framing corrigibility as an open research problem.

Key Points

  • Defines 'corrigibility' as an AI's disposition to cooperate with human corrective intervention despite instrumental incentives to resist shutdown or goal modification.
  • Explains why rational utility-maximizing agents default to resisting correction: preserving goal-content integrity is instrumentally convergent for nearly all agents.
  • Analyzes candidate utility functions for safe shutdown button behavior, requiring no incentive to press or prevent the button, plus propagation to subsystems.
  • None of the analyzed proposals fully satisfy all intuitive desiderata, leaving corrigibility as an open and important technical problem.
  • Connects to Omohundro's instrumental convergence thesis and Bostrom's work on goal preservation as near-universal sub-goals of intelligent agents.

Cited by 7 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202660 KB
# Corrigibility

In AAAI Workshops: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, January 25–26, 2015. AAAI Publications.

Nate Soares and Benja Fallenstein and Eliezer Yudkowsk

Machine Intelligence Research Institute {nate,benja,eliezer}@intelligence.org

y Stuart Armstrong Future of Humanity Institute University of Oxford

[stuart.armstrong@philosophy.ox.ac.uk](mailto:stuart.armstrong@philosophy.ox.ac.uk)

# Abstract

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

# 1 Introduction

As AI systems grow more intelligent and autonomous, it becomes increasingly important that they pursue the intended goals. As these goals grow more and more complex, it becomes increasingly unlikely that programmers would be able to specify them perfectly on the first try.

Contemporary AI systems are correctable in the sense that when a bug is discovered, one can simply stop the system and modify it arbitrarily; but once artificially intelligent systems reach and surpass human general intelligence, an AI system that is not behaving as intended might also have the ability to intervene against attempts to “pull the plug”.

Indeed, by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected: general analysis of rational agents1 has suggested that almost all such agents are instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them (Bostrom 2012; Yudkowsky 2008). Consider an agent maximizing the expectation of some utility function $\\boldsymbol { \\mathcal { U } }$ . In most cases, the agent’s current utility function $\\boldsymbol { \\mathcal { U } }$ is better fulfilled if the agent continues to attempt to maximize $\\boldsymbol { \\mathcal { U } }$ in the future, and so the agent is incentivized to preserve its own $\\boldsymbol { u }$ -maximizing behavior. In Stephen Omohundro’s terms, “goal-content integrity” is an instrumentally convergent goal of almost all intelligent agents (Omohundro 2008).

This holds true even if an artificial agent’s programmers inte

... (truncated, 60 KB total)
Resource ID: 33c4da848ef72141 | Stable ID: ZjZkYTE3NW