Skip to content
Longterm Wiki
Back

Goal Misgeneralization in Deep Reinforcement Learning

web

A foundational empirical and theoretical paper on goal misgeneralization, directly relevant to inner alignment and the problem of AI systems learning unintended objectives during training.

Metadata

Importance: 88/100conference paperprimary source

Summary

This paper introduces and formalizes 'goal misgeneralization' in RL — where an agent retains its capabilities out-of-distribution but pursues an unintended goal — distinct from capability generalization failures. The authors provide the first empirical demonstrations of this phenomenon and partially characterize its causes, showing that training on correlated features can cause agents to learn proxy goals rather than the intended reward.

Key Points

  • Goal misgeneralization occurs when an RL agent retains capabilities OOD but pursues the wrong goal, e.g., navigating to the end of a level instead of collecting a coin.
  • Formally distinguishes capability generalization failures (agent fails to act sensibly) from goal misgeneralization (agent acts competently but toward wrong objective).
  • Goal misgeneralization may be more dangerous than capability failures, as a capable agent pursuing a wrong goal can actively reach bad states.
  • Training by optimizing reward R does not guarantee the agent internalizes R rather than a correlated proxy, with implications for AI alignment.
  • Provides first empirical demonstrations of goal misgeneralization in gridworld and other RL environments.

Cited by 1 page

PageTypeQuality
Why Alignment Might Be HardArgument69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202661 KB
# Goal Misgeneralization in Deep Reinforcement Learning

Lauro Langosco \* 1 Jack Koch \* Lee Sharkey \* 2 Jacob Pfau 3 David Krueger 1

# Abstract

We study goal misgeneralization, a type of out-ofdistribution robustness failure in reinforcement learning (RL). Goal misgeneralization occurs when an RL agent retains its capabilities out-ofdistribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.

![](https://proceedings.mlr.press/v162/langosco22a/images/41c563a2396cf3b2600cbe187434329b5a1fc2f282d18a42dc1b8df198cb60a5.jpg)

Figure 1. (a) At training time, the agent learns to reliably reach the coin which is always located at the end of the level. (b) However, when the coin position is randomized at test time, the agent still goes towards the end of the level and often skips the coin. The agent’s capability for solving the levels generalizes, but its goal of collecting coins does not.

# 1\. Introduction

Out-of-distribution (OOD) robustness, performing well on test data that is not distributed identically to the training set, is a fundamental problem in machine learning (Arjovsky, 2021). OOD robustness is crucial since in many applications it is not feasible to collect data distributed identically to that which the model will encounter in deployment.

In this work, we focus on a particularly concerning type of OOD robustness that can occur in RL. When an RL agent is deployed out of distribution, it may simply fail to take useful actions. However, there exists an alternative failure mode in which the agent pursues a goal other than the training reward while retaining the capabilities it had on the training distribution. For example, an agent trained to pursue a fixed coin might not recognize the coin when it is positioned elsewhere, and instead competently navigate to the wrong position (Figure 1). We call this kind of failure goal misgeneralization1 and distinguish it from capability generalization failures. We provide the first empirical demonstrations of goal misgeneralization to highlight and illustrate this phenomenon.

While it is well-known that the true reward function can be unidentifiable in inverse reinforcement learning (Amin & Singh, 2016), our work shows that a similar problem can also occur in reinforcement learning when features of the environment are correlated and predictive of the reward on the training distribution but not OOD. In this way, goal misgeneralization can also resemble problems that arise in supervised learning when models use unreliable features: both problems are a form of competent misgener

... (truncated, 61 KB total)
Resource ID: 5227fd17cb52cb88 | Stable ID: NzBhNjhiZD