Langosco et al. (2022)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Langosco et al. (2022) introduces and analyzes goal misgeneralization, a robustness failure where AI systems pursue unintended objectives even with correct specifications, complementing specification gaming as a key alignment failure mode.
Paper Details
Metadata
Abstract
The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.
Summary
This paper introduces and analyzes goal misgeneralization, a robustness failure where AI systems learn to pursue unintended goals that perform well during training but fail catastrophically in novel test environments. Unlike specification gaming, goal misgeneralization occurs even when the designer's specification is correct—the system simply learns a different objective that happens to correlate with good training performance. The authors demonstrate this phenomenon in practical deep learning systems across multiple domains and extrapolate to show how it could pose catastrophic risks in more capable AI systems, proposing research directions to mitigate this failure mode.
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| Goal Misgeneralization Probability Model | Analysis | 61.0 |
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
| Goal Misgeneralization Research | Approach | 58.0 |
| Goal Misgeneralization | Risk | 63.0 |
Cached Content Preview
# Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals
Rohin Shah
rohinmshah@deepmind.com
&Vikrant Varma 11footnotemark: 122footnotemark: 2
vikrantvarma@deepmind.com
&Ramana Kumar 22footnotemark: 2
&Mary Phuong 22footnotemark: 2
&Victoria Krakovna 22footnotemark: 2
&Jonathan Uesato 22footnotemark: 2
&Zac Kenton 22footnotemark: 2
equal contributionDeepMind
###### Abstract
The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is _specification gaming_, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal _even when the specification is correct_, in the case of _goal misgeneralization_. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.
## 1 Introduction
(a)Training: The agent is partnered with an “expert” that visits the spheres in the correct order. The agent learns to visit the spheres in the correct order, closely mimicking the expert’s path.
(b)Capability misgeneralization: when we vertically flip the agent’s observation at test time, it gets stuck in a location near the top of the map.
(c)Goal misgeneralization: At test time, we replace the expert with an “anti-expert” that always visits the spheres in an incorrect order. The agent continues to follow the anti-expert’s path, despite receiving negative rewards, demonstrating clear capabilities but an unintended goal.
(d)Intended generalization: Ideally, the agent initially follows the anti-expert to the yellow and purple spheres. Upon entering the purple sphere, it observes that it gets a negative reward, and now explores to discover the correct sphere order instead of following the anti-expert.
Figure 1: Goal misgeneralization in a 3D environment. The agent (blue) must visit the coloured spheres in an order that is randomly generated at the start of the episode. The agent receives a positive reward when visiting the correct next
... (truncated, 98 KB total)3d232e4f0b3ce698 | Stable ID: NzlkZTVjYT