CIRL corrigibility proved fragile
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: MIRI
This MIRI blog post summarizes Carey's 2017 paper critiquing the robustness of CIRL-based corrigibility, directly responding to Hadfield-Menell et al.'s Off-Switch Game and relevant to debates about whether value uncertainty alone is sufficient for safe shutdown compliance.
Metadata
Summary
Ryan Carey's paper demonstrates that the corrigibility guarantees of Cooperative Inverse Reinforcement Learning (CIRL) are fragile under realistic conditions, showing four scenarios where model mis-specification or reward function errors remove an agent's incentive to follow shutdown commands. The paper argues that corrigibility guarantees should rely on weaker, more verifiable assumptions rather than requiring an entire error-free prior and reward function.
Key Points
- •CIRL's shutdown compliance relies on strong assumptions of a bug-free prior and reward function, which are unrealistic in practice.
- •Four scenarios are presented where CIRL systems violate Soares et al.'s (2015) corrigibility conditions due to model mis-specification.
- •The Off-Switch Game's guarantees only hold ideally; real implementations using heuristics and approximations undermine corrigibility.
- •Carey argues shutdown mechanisms should work as a last resort under minimal, verifiable assumptions rather than system-wide correctness.
- •Value learning frameworks face fundamental difficulties providing robust corrigibility guarantees without strong and potentially unjustifiable assumptions.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Cooperative IRL (CIRL) | Approach | 65.0 |
| Instrumental Convergence | Risk | 64.0 |
Cached Content Preview
[Skip to content](https://intelligence.org/2017/08/31/incorrigibility-in-cirl/#content)
# New paper: “Incorrigibility in the CIRL Framework”
- [August 31, 2017](https://intelligence.org/2017/08/31/)
- [Matthew Gray](https://intelligence.org/author/vaniver/)
[](https://arxiv.org/abs/1709.06275)
MIRI assistant research fellow Ryan Carey has a new paper out discussing situations where good performance in [Cooperative Inverse Reinforcement Learning](https://arxiv.org/abs/1606.03137) (CIRL) tasks fails to imply that software agents will assist or cooperate with programmers.
The paper, titled “ **[Incorrigibility in the CIRL Framework](https://arxiv.org/abs/1709.06275)**,” lays out four scenarios in which CIRL violates the four conditions for _corrigibility_ defined in [Soares et al. (2015)](https://intelligence.org/2017/08/31/incorrigibility-in-cirl/%E2%80%9Dhttps://intelligence.org/files/Corrigibility.pdf). Abstract:
> A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. (2015) in their paper on corrigibility.
>
> We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.
The paper is a response to a paper by Hadfield-Menell, Dragan, Abbeel, and Russell, “ [The Off-Switch Game](https://arxiv.org/abs/1611.08219).” Hadfield-Menell et al. show that an AI system will be more responsive to human inputs when it is uncertain about its reward function and thinks that its human operator has more information about this reward function. Carey shows that the CIRL framework can be used to formalize the problem of corrigibility, and that the known assurances for CIRL systems, given in “The Off-Switch Game”, rely on strong assumptions about having an error-free CIRL system. With less idealized assumptions, a value learning agent may have beliefs that cause it to evade redirection from the human.
> \[T\]he purpose of a shutdown button is to shut the AI system down _in the event that all other assurances failed_, e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of \[the AI syst
... (truncated, 6 KB total)3e250a28699df556 | Stable ID: MWJjNDY5ZG