Goal Misgeneralization in Deep Reinforcement Learning

web

proceedings.mlr.press·proceedings.mlr.press/v162/langosco22a/langosco22a.pdf

A foundational empirical and theoretical paper on goal misgeneralization, directly relevant to inner alignment and the problem of AI systems learning unintended objectives during training.

Metadata

Importance: 88/100conference paperprimary source

Summary

This paper introduces and formalizes 'goal misgeneralization' in RL — where an agent retains its capabilities out-of-distribution but pursues an unintended goal — distinct from capability generalization failures. The authors provide the first empirical demonstrations of this phenomenon and partially characterize its causes, showing that training on correlated features can cause agents to learn proxy goals rather than the intended reward.

Key Points

•Goal misgeneralization occurs when an RL agent retains capabilities OOD but pursues the wrong goal, e.g., navigating to the end of a level instead of collecting a coin.
•Formally distinguishes capability generalization failures (agent fails to act sensibly) from goal misgeneralization (agent acts competently but toward wrong objective).
•Goal misgeneralization may be more dangerous than capability failures, as a capable agent pursuing a wrong goal can actively reach bad states.
•Training by optimizing reward R does not guarantee the agent internalizes R rather than a correlated proxy, with implications for AI alignment.
•Provides first empirical demonstrations of goal misgeneralization in gridworld and other RL environments.

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202660 KB

Goal Misgeneralization in Deep Reinforcement Learning
Lauro Langosco * 1 Jack Koch * Lee Sharkey * 2 Jacob Pfau 3 David Krueger 1
Abstract
We study goal misgeneralization, a type of out-of-
distribution robustness failure in reinforcement
learning (RL). Goal misgeneralization occurs
when an RL agent retains its capabilities out-of-
distribution yet pursues the wrong goal. For in-
stance, an agent might continue to competently
avoid obstacles, but navigate to the wrong place.
In contrast, previous works have typically focused
on capability generalization failures, where an
agent fails to do anything sensible at test time.
We formalize this distinction between capability
and goal generalization, provide the first empiri-
cal demonstrations of goal misgeneralization, and
present a partial characterization of its causes.
1. Introduction
Out-of-distribution (OOD) robustness, performing well on
test data that is not distributed identically to the training set,
is a fundamental problem in machine learning (Arjovsky,
2021). OOD robustness is crucial since in many applications
it is not feasible to collect data distributed identically to that
which the model will encounter in deployment.
In this work, we focus on a particularly concerning type of
OOD robustness that can occur in RL. When an RL agent
is deployed out of distribution, it may simply fail to take
useful actions. However, there exists an alternative fail-
ure mode in which the agent pursues a goal other than the
training reward while retaining the capabilities it had on
the training distribution. For example, an agent trained to
pursue a fixed coin might not recognize the coin when it is
positioned elsewhere, and instead competently navigate to
the wrong position (Figure 1). We call this kind of failure
goal misgeneralization1 and distinguish it from capabil-
*Equal contribution 1University of Cambridge 2University of
T ¨ubingen 3University of Edinburgh. Correspondence to: Lauro
Langosco <langosco.lauro@gmail.com>.
Proceedings of the 39 th International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s).
1Here, ‘goal’ does not refer only to goal-states in MDPs, but to
goal-directed (optimizing) behavior more broadly.
(a) Goal position fixed 	(b) Goal position randomized
Figure 1. (a) At training time, the agent learns to reliably reach the
coin which is always located at the end of the level. (b) However,
when the coin position is randomized at test time, the agent still
goes towards the end of the level and often skips the coin. The
agent’s capability for solving the levels generalizes, but its goal of
collecting coins does not.
ity generalization failures. We provide the first empirical
demonstrations of goal misgeneralization to highlight and
illustrate this phenomenon.
While it is well-known that the true reward function can
be unidentifiable in inverse reinforcement learning (Amin
& Singh, 2016), our work shows that a similar problem

... (truncated, 60 KB total)

Resource ID: 5227fd17cb52cb88 | Stable ID: sid_t3bHIsLhOO