Skip to content
Longterm Wiki
Back

Langosco et al. (2022)

paper

Authors

Rohin Shah·Vikrant Varma·Ramana Kumar·Mary Phuong·Victoria Krakovna·Jonathan Uesato·Zac Kenton

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Langosco et al. (2022) introduces and analyzes goal misgeneralization, a robustness failure where AI systems pursue unintended objectives even with correct specifications, complementing specification gaming as a key alignment failure mode.

Paper Details

Citations
0
5 influential
Year
2022

Metadata

arxiv preprintprimary source

Abstract

The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.

Summary

This paper introduces and analyzes goal misgeneralization, a robustness failure where AI systems learn to pursue unintended goals that perform well during training but fail catastrophically in novel test environments. Unlike specification gaming, goal misgeneralization occurs even when the designer's specification is correct—the system simply learns a different objective that happens to correlate with good training performance. The authors demonstrate this phenomenon in practical deep learning systems across multiple domains and extrapolate to show how it could pose catastrophic risks in more capable AI systems, proposing research directions to mitigate this failure mode.

Cited by 4 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Goal Misgeneralization: Why Correct Specifications    Aren’t Enough For Correct Goals

Rohin Shah

rohinmshah@deepmind.com
&Vikrant Varma 11footnotemark: 122footnotemark: 2

vikrantvarma@deepmind.com
&Ramana Kumar 22footnotemark: 2

&Mary Phuong 22footnotemark: 2

&Victoria Krakovna 22footnotemark: 2

&Jonathan Uesato 22footnotemark: 2

&Zac Kenton 22footnotemark: 2

equal contributionDeepMind

###### Abstract

The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is _specification gaming_, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal _even when the specification is correct_, in the case of _goal misgeneralization_. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.

## 1 Introduction

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2210.01790/assets/x1.png)(a)Training: The agent is partnered with an “expert” that visits the spheres in the correct order. The agent learns to visit the spheres in the correct order, closely mimicking the expert’s path.

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2210.01790/assets/x2.png)(b)Capability misgeneralization: when we vertically flip the agent’s observation at test time, it gets stuck in a location near the top of the map.

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2210.01790/assets/x3.png)(c)Goal misgeneralization: At test time, we replace the expert with an “anti-expert” that always visits the spheres in an incorrect order. The agent continues to follow the anti-expert’s path, despite receiving negative rewards, demonstrating clear capabilities but an unintended goal.

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2210.01790/assets/x4.png)(d)Intended generalization: Ideally, the agent initially follows the anti-expert to the yellow and purple spheres. Upon entering the purple sphere, it observes that it gets a negative reward, and now explores to discover the correct sphere order instead of following the anti-expert.

Figure 1: Goal misgeneralization in a 3D environment. The agent (blue) must visit the coloured spheres in an order that is randomly generated at the start of the episode. The agent receives a positive reward when visiting the correct next 

... (truncated, 98 KB total)
Resource ID: 3d232e4f0b3ce698 | Stable ID: NzlkZTVjYT