LessWrong: "Disentangling Corrigibility: 2015-2021"

blog

2021·LessWrong·lesswrong.com/posts/MiYkTp6QYKXdJbchu/disentangling-corri...

Author

Koen.Holtman

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A useful taxonomic reference for researchers navigating the corrigibility literature; helps clarify terminological confusion by tracing how the concept evolved across six years of AI safety research.

Forum Post Details

Karma

Comments

Forum

lesswrong

Forum Tags

CorrigibilityWireheadingAI

Part of sequence: Counterfactual Planning

Metadata

Importance: 62/100blog postanalysis

Summary

Koen Holtman maps the evolution of corrigibility research from 2015-2021, tracing how the concept has expanded from the original MIRI/FHI paper's open-ended desiderata into multiple formalisms and interpretations. The post clarifies distinctions between corrigibility as resistance-to-shutdown, as provable safety properties, and as broader human-control mechanisms, providing a navigational guide through the fragmented literature.

Key Points

•The 2015 MIRI/FHI paper introduced corrigibility via open-ended desiderata rather than a single definition, explicitly inviting future researchers to identify additional criteria.
•Corrigibility desiderata can be mapped to mathematical statements, enabling formal proofs that agent designs satisfy or fail specific safety properties.
•A key corrigibility requirement is that sufficiently capable agents must not resist shutdown—by disabling code, blocking buttons, or psychologically manipulating operators.
•Many distinct interpretations of corrigibility have emerged since 2015, some quite distant from the original paper's framing, causing definitional confusion in the field.
•The post is part of a broader 'counterfactual planning' sequence, linking corrigibility to decision-theoretic approaches for maintaining human oversight.

Cited by 1 page

Page	Type	Quality
Corrigibility	Research Area	59.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202619 KB

# Disentangling Corrigibility: 2015-2021
By Koen.Holtman
Published: 2021-02-16

Since the term *corrigibility*
[was introduced in
2015](https://intelligence.org/files/Corrigibility.pdf),
there has been a lot of discussion about corrigibility, [on this
forum](https://www.lesswrong.com/tag/corrigibility) and elsewhere.

In this post, I have tied to disentangle the many forms of
corrigibility which have been identified and discussed so far.  My aim
is to offer a general map for anybody who wants to understand and
navigate the current body of work and opinion on corrigibility.

*\[This is a stand-alone post in the counterfactual planning
sequence.  My original plan was to write only about how
counterfactual planning was related to corrigibility, but 
it snowballed from there.\]*


# The 2015 paper 

The technical term corrigibility, a name suggested by
Robert Miles to denote concepts previously discussed at MIRI, was
introduced to the AGI safety/alignment community in the 2015 paper
MIRI/FHI paper titled
[Corrigibility](https://intelligence.org/files/Corrigibility.pdf).

## An open-ended list of corrigibility desiderata

The 2015 paper does not define corrigibility in full: instead the
authors present initial lists of *corrigibility desiderata*.  If the
agent fails on one of these desiderata, it is definitely not
corrigible.

But even if it provably satisfies all of the desiderata included in
the paper, the authors allow for the possibility that the agent might
not be fully corrigible.

The paper extends an open invitation to identify more corrigibility
desiderata, and many more have been identified since.  Some of them
look nothing like the original desiderata proposed in the paper.
Opinions have occasionally been mixed on whether some specific
desiderata are related to the *intuitive notion of corrigibility* at
all.

## Corrigibility desiderata as provable safety properties

The most detailed list of desiderata in the 2015 paper applies to
agents that have a physical shutdown button.  The paper made the
important contribution of mapping most of these desiderata to
equivalent mathematical statements, so that one might prove that a
particular agent design would meet these desiderata.

The paper proved a negative result: it considered a proposed agent
design that provably failed to meet some of the desiderata.  Agent
designs that provably meet more of them have since been developed, for
example [here](https://arxiv.org/abs/1908.01695).  There has also been
a lot of work on developing and understanding the type of mathematics
that might be used for stating desiderata.


# Corrigibility as a lack of resistance to shutdown

Say that an agent has been equipped with a physical shutdown button.
One desideratum for corrigibility is then that the agent must never
attempt to prevent its shutdown button from being pressed.  To be
corrigible, it should always defer to the humans who try to shut it
down.

The 2015 paper considers that

> It is straightforward to 

... (truncated, 19 KB total)

Resource ID: 2f825636f5066205 | Stable ID: sid_mzTJwacpwh