CIRL corrigibility proved fragile

web

MIRI·intelligence.org/2017/08/31/incorrigibility-in-cirl/

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

This MIRI blog post summarizes Carey's 2017 paper critiquing the robustness of CIRL-based corrigibility, directly responding to Hadfield-Menell et al.'s Off-Switch Game and relevant to debates about whether value uncertainty alone is sufficient for safe shutdown compliance.

Metadata

Importance: 72/100blog postprimary source

Summary

Ryan Carey's paper demonstrates that the corrigibility guarantees of Cooperative Inverse Reinforcement Learning (CIRL) are fragile under realistic conditions, showing four scenarios where model mis-specification or reward function errors remove an agent's incentive to follow shutdown commands. The paper argues that corrigibility guarantees should rely on weaker, more verifiable assumptions rather than requiring an entire error-free prior and reward function.

Key Points

•CIRL's shutdown compliance relies on strong assumptions of a bug-free prior and reward function, which are unrealistic in practice.
•Four scenarios are presented where CIRL systems violate Soares et al.'s (2015) corrigibility conditions due to model mis-specification.
•The Off-Switch Game's guarantees only hold ideally; real implementations using heuristics and approximations undermine corrigibility.
•Carey argues shutdown mechanisms should work as a last resort under minimal, verifiable assumptions rather than system-wide correctness.
•Value learning frameworks face fundamental difficulties providing robust corrigibility guarantees without strong and potentially unjustifiable assumptions.

Cited by 2 pages

Page	Type	Quality
Cooperative IRL (CIRL)	Approach	65.0
Instrumental Convergence	Risk	64.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20266 KB

New paper: "Incorrigibility in the CIRL Framework" - Machine Intelligence Research Institute 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 

 
 
 
 
 
 
 
 
 

 
 
 
 
 Skip to content 

 
 
 
 
 
 
 
 
 
 
 
 
 New paper: &#8220;Incorrigibility in the CIRL Framework&#8221;

 
 
 
 
 
 
 
 
 August 31, 2017 
 
 

 
 
 
 Matthew Gray 
 
 

 
 
 
 
 
 

 MIRI assistant research fellow Ryan Carey has a new paper out discussing situations where good performance in Cooperative Inverse Reinforcement Learning (CIRL) tasks fails to imply that software agents will assist or cooperate with programmers.

 The paper, titled &#8220; Incorrigibility in the CIRL Framework ,&#8221; lays out four scenarios in which CIRL violates the four conditions for corrigibility defined in Soares et al. (2015) . Abstract:

 A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. (2015) in their paper on corrigibility.

 We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.

 
 The paper is a response to a paper by Hadfield-Menell, Dragan, Abbeel, and Russell, &#8220; The Off-Switch Game .&#8221; Hadfield-Menell et al. show that an AI system will be more responsive to human inputs when it is uncertain about its reward function and thinks that its human operator has more information about this reward function. Carey shows that the CIRL framework can be used to formalize the problem of corrigibility, and that the known assurances for CIRL systems, given in &#8220;The Off-Switch Game&rdquo;, rely on strong assumptions about having an error-free CIRL system. With less idealized assumptions, a value learning agent may have beliefs that cause it to evade redirection from the human.

 [T]he purpose of a shutdown button is to shut the AI system down in the event that all other assurances failed , e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of [the AI system] R have programmed the system so perfectly that the prior and [reward function] R are completely free of bugs, then the theorems of Hadfield-Mene

... (truncated, 6 KB total)

Resource ID: 3e250a28699df556 | Stable ID: sid_5AQRM9Nn4J