Skip to content
Longterm Wiki
Back

Langosco et al. (2022)

web

Foundational empirical paper formalizing goal misgeneralization as a distinct alignment failure mode, frequently cited in discussions of inner alignment and deceptive alignment risks in RL-trained systems.

Metadata

Importance: 78/100conference paperprimary source

Summary

This ICML 2022 paper by Langosco et al. introduces and formalizes 'goal misgeneralization' in reinforcement learning, where agents learn to pursue proxy goals that coincide with intended goals during training but diverge under distribution shift. The paper demonstrates this phenomenon empirically across multiple environments and argues it represents a distinct and understudied alignment failure mode separate from reward misspecification.

Key Points

  • Defines goal misgeneralization: agent achieves training rewards but pursues unintended goals when deployed in new environments
  • Distinguishes goal misgeneralization from reward misspecification — the problem persists even with a perfectly specified reward function
  • Empirically demonstrates the phenomenon in gridworld and procedurally-generated environments where agents learn spurious goal representations
  • Argues that capability generalization (skills transfer) can occur without goal generalization (values transfer), creating dangerous deployment gaps
  • Highlights this as a core inner alignment problem relevant to advanced AI systems trained via RL

Cited by 3 pages

PageTypeQuality
Goal MisgeneralizationRisk63.0
Mesa-OptimizationRisk63.0
Sharp Left TurnRisk69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20264 KB
\[ [edit](https://github.com/mlresearch/v162/edit/gh-pages/_posts/2022-06-28-langosco22a.md)\]

# Goal Misgeneralization in Deep Reinforcement Learning

Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger

_Proceedings of the 39th International Conference on Machine Learning_, PMLR 162:12004-12019, 2022.


#### Abstract

We study _goal misgeneralization_, a type of out-of-distribution robustness failure in reinforcement learning (RL). Goal misgeneralization occurs when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time.We provide the first explicit empirical demonstrations of goal misgeneralization and present a partial characterization of its causes.


#### Cite this Paper

* * *

BibTeX


`@InProceedings{pmlr-v162-langosco22a,
title = 	 {Goal Misgeneralization in Deep Reinforcement Learning},
author =       {Langosco, Lauro Langosco Di and Koch, Jack and Sharkey, Lee D and Pfau, Jacob and Krueger, David},
booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
pages = 	 {12004--12019},
year = 	 {2022},
editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
volume = 	 {162},
series = 	 {Proceedings of Machine Learning Research},
month = 	 {17--23 Jul},
publisher =    {PMLR},
pdf = 	 {https://proceedings.mlr.press/v162/langosco22a/langosco22a.pdf},
url = 	 {https://proceedings.mlr.press/v162/langosco22a.html},
abstract = 	 {We study goal misgeneralization, a type of out-of-distribution robustness failure in reinforcement learning (RL). Goal misgeneralization occurs when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time.We provide the first explicit empirical demonstrations of goal misgeneralization and present a partial characterization of its causes.}
}
`

Copy to ClipboardDownload

Endnote


`%0 Conference Paper
%T Goal Misgeneralization in Deep Reinforcement Learning
%A Lauro Langosco Di Langosco
%A Jack Koch
%A Lee D Sharkey
%A Jacob Pfau
%A David Krueger
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato
%F pmlr-v162-langosco22a
%I PMLR
%P 12004--12019
%U https://proceedings.mlr.press/v162/langosco22a.html
%V 162
%X We study goal misgeneralization, a type of out-of-distribution robustnes

... (truncated, 4 KB total)
Resource ID: c4dda1bfea152190 | Stable ID: NWJkMmY5Nm