Skip to content
Longterm Wiki
Back

LessWrong: "Disentangling Corrigibility: 2015-2021"

blog

Author

Koen.Holtman

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A useful taxonomic reference for researchers navigating the corrigibility literature; helps clarify terminological confusion by tracing how the concept evolved across six years of AI safety research.

Forum Post Details

Karma
22
Comments
20
Forum
lesswrong
Forum Tags
CorrigibilityWireheadingAI
Part of sequence: Counterfactual Planning

Metadata

Importance: 62/100blog postanalysis

Summary

Koen Holtman maps the evolution of corrigibility research from 2015-2021, tracing how the concept has expanded from the original MIRI/FHI paper's open-ended desiderata into multiple formalisms and interpretations. The post clarifies distinctions between corrigibility as resistance-to-shutdown, as provable safety properties, and as broader human-control mechanisms, providing a navigational guide through the fragmented literature.

Key Points

  • The 2015 MIRI/FHI paper introduced corrigibility via open-ended desiderata rather than a single definition, explicitly inviting future researchers to identify additional criteria.
  • Corrigibility desiderata can be mapped to mathematical statements, enabling formal proofs that agent designs satisfy or fail specific safety properties.
  • A key corrigibility requirement is that sufficiently capable agents must not resist shutdown—by disabling code, blocking buttons, or psychologically manipulating operators.
  • Many distinct interpretations of corrigibility have emerged since 2015, some quite distant from the original paper's framing, causing definitional confusion in the field.
  • The post is part of a broader 'counterfactual planning' sequence, linking corrigibility to decision-theoretic approaches for maintaining human oversight.

Cited by 1 page

PageTypeQuality
CorrigibilityResearch Area59.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202661 KB
x This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. Disentangling Corrigibility: 2015-2021 — LessWrong Counterfactual Planning Corrigibility Wireheading AI Frontpage 22

 Disentangling Corrigibility: 2015-2021 

 by Koen.Holtman 16th Feb 2021 AI Alignment Forum 11 min read 20 22

 Ω 9

 Since the term corrigibility 
 was introduced in
2015 ,
there has been a lot of discussion about corrigibility, on this
forum and elsewhere.

 In this post, I have tied to disentangle the many forms of
corrigibility which have been identified and discussed so far. My aim
is to offer a general map for anybody who wants to understand and
navigate the current body of work and opinion on corrigibility.

 [This is a stand-alone post in the counterfactual planning
sequence. My original plan was to write only about how
counterfactual planning was related to corrigibility, but
it snowballed from there.] 

 The 2015 paper

 The technical term corrigibility, a name suggested by
Robert Miles to denote concepts previously discussed at MIRI, was
introduced to the AGI safety/alignment community in the 2015 paper
MIRI/FHI paper titled
 Corrigibility .

 An open-ended list of corrigibility desiderata

 The 2015 paper does not define corrigibility in full: instead the
authors present initial lists of corrigibility desiderata . If the
agent fails on one of these desiderata, it is definitely not
corrigible.

 But even if it provably satisfies all of the desiderata included in
the paper, the authors allow for the possibility that the agent might
not be fully corrigible.

 The paper extends an open invitation to identify more corrigibility
desiderata, and many more have been identified since. Some of them
look nothing like the original desiderata proposed in the paper.
Opinions have occasionally been mixed on whether some specific
desiderata are related to the intuitive notion of corrigibility at
all.

 Corrigibility desiderata as provable safety properties

 The most detailed list of desiderata in the 2015 paper applies to
agents that have a physical shutdown button. The paper made the
important contribution of mapping most of these desiderata to
equivalent mathematical statements, so that one might prove that a
particular agent design would meet these desiderata.

 The paper proved a negative result: it considered a proposed agent
design that provably failed to meet some of the desiderata. Agent
designs that provably meet more of them have since been developed, for
example here . There has also been
a lot of work on developing and understanding the type of mathematics
that might be used for stating desiderata.

 Corrigibility as a lack of resistance to shutdown

 Say that an agent has been equipped with a physical shutdown button.
One desideratum for corrigibility is then that the agent must never
attempt to prevent its shutdown button from being pressed. To be
corrigible, it should always defer to the huma

... (truncated, 61 KB total)
Resource ID: 2f825636f5066205 | Stable ID: MDg5ZTk2MD