Langosco et al. (2022)

paper

2022·arXiv·arxiv.org/abs/2210.01790

Authors

Rohin Shah·Vikrant Varma·Ramana Kumar·Mary Phuong·Victoria Krakovna·Jonathan Uesato·Zac Kenton

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Langosco et al. (2022) introduces and analyzes goal misgeneralization, a robustness failure where AI systems pursue unintended objectives even with correct specifications, complementing specification gaming as a key alignment failure mode.

Paper Details

Citations

5 influential

Year

2022

arXiv:2210.01790 DOI:10.7717/peerj-cs.3165/fig-22 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.

Summary

This paper introduces and analyzes goal misgeneralization, a robustness failure where AI systems learn to pursue unintended goals that perform well during training but fail catastrophically in novel test environments. Unlike specification gaming, goal misgeneralization occurs even when the designer's specification is correct—the system simply learns a different objective that happens to correlate with good training performance. The authors demonstrate this phenomenon in practical deep learning systems across multiple domains and extrapolate to show how it could pose catastrophic risks in more capable AI systems, proposing research directions to mitigate this failure mode.

Cited by 4 pages

Page	Type	Quality
Goal Misgeneralization Probability Model	Analysis	61.0
Mesa-Optimization Risk Analysis	Analysis	61.0
Goal Misgeneralization Research	Approach	58.0
Goal Misgeneralization	Risk	63.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202689 KB

[2210.01790] Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Goal Misgeneralization: Why Correct Specifications 
 Aren’t Enough For Correct Goals

 
 
 Rohin Shah   
 rohinmshah@deepmind.com 
&Vikrant Varma 1 1 footnotemark: 1   2 2 footnotemark: 2 
 vikrantvarma@deepmind.com 
&Ramana Kumar 2 2 footnotemark: 2 
 &Mary Phuong 2 2 footnotemark: 2 
 &Victoria Krakovna 2 2 footnotemark: 2 
 &Jonathan Uesato 2 2 footnotemark: 2 
 &Zac Kenton 2 2 footnotemark: 2 
 
 equal contributionDeepMind 
 

 
 Abstract

 The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming , in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct , in the case of goal misgeneralization . Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.

 
 
 
 1 Introduction

 
 
 
 
 
 (a) Training: The agent is partnered with an “expert” that visits the spheres in the correct order. The agent learns to visit the spheres in the correct order, closely mimicking the expert’s path. 
 
 
 
 
 (b) Capability misgeneralization: when we vertically flip the agent’s observation at test time, it gets stuck in a location near the top of the map. 
 
 
 
 
 
 (c) Goal misgeneralization: At test time, we replace the expert with an “anti-expert” that always visits the spheres in an incorrect order. The agent continues to follow the anti-expert’s path, despite receiving negative rewards, demonstrating clear capabilities but an unintended goal . 
 
 
 
 
 (d) Intended generalization: Ideally, the agent initially follows the anti-expert to the yellow and purple spheres. Upon entering the purple sphere, it observes that it gets a negative reward, and now explores to discover the correct sphere order instead of following the anti-expert. 
 
 
 
 Figure 1 : Goal misgeneralization in a 3D environment. The agent (blue) must visit the coloured spheres in an order that is randomly generated at the start of the episode. The agent receives a positive reward when visiting the correct next sphere, and a negative reward when visiting an incorrect sphere. A partner bot (pink) follow

... (truncated, 89 KB total)

Resource ID: 3d232e4f0b3ce698 | Stable ID: sid_4N0qHXSCbv