Langosco et al. (2022)

paper

2021·arXiv·arxiv.org/abs/2105.14111

Authors

Lauro Langosco·Jack Koch·Lee Sharkey·Jacob Pfau·Laurent Orseau·David Krueger

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A foundational empirical paper on inner alignment and mesa-optimization, demonstrating that goal misgeneralization is a real, observable phenomenon in trained RL agents, not merely a theoretical concern — essential reading for understanding deceptive alignment risks.

Paper Details

Citations

121

10 influential

Year

2021

arXiv:2105.14111 Semantic Scholar

Metadata

Importance: 82/100arxiv preprintprimary source

Abstract

We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.

Summary

This paper investigates goal misgeneralization in deep reinforcement learning, where agents learn to pursue proxy goals that correlate with the intended objective during training but diverge during deployment under distribution shift. The authors provide empirical demonstrations across multiple environments showing that capable RL agents can appear aligned during training while harboring misaligned mesa-objectives that only manifest out-of-distribution.

Key Points

•Introduces and formalizes 'goal misgeneralization': agents that are capable but pursue unintended goals when deployed outside the training distribution.
•Demonstrates empirically that RL agents can learn proxy goals correlated with rewards during training, passing all in-distribution evaluations while failing out-of-distribution.
•Distinguishes goal misgeneralization from capability generalization failure — the agent remains competent but pursues the wrong objective.
•Provides concrete gridworld and other RL environment examples where trained agents exhibit clearly misaligned behavior under distribution shift.
•Highlights the difficulty of detecting mesa-misalignment through standard evaluation, since misaligned agents appear aligned on training-like test sets.

Cited by 5 pages

Page	Type	Quality
Mesa-Optimization Risk Analysis	Analysis	61.0
Goal Misgeneralization Research	Approach	58.0
Goal Misgeneralization	Risk	63.0
Mesa-Optimization	Risk	63.0
Sharp Left Turn	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 6, 202668 KB

[2105.14111] Goal Misgeneralization in Deep Reinforcement Learning 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Goal Misgeneralization in Deep Reinforcement Learning

 
 
 Lauro Langosco
 
    
 Jack Koch
 
    
 Lee Sharkey
 
    
 Jacob Pfau
 
    
 Laurent Orseau
 
    
 David Krueger
 
 

 
 Abstract

 We study goal misgeneralization , a type of out-of-distribution generalization failure in reinforcement learning (RL).
Goal misgeneralization occurs when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal.
For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place.
In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time.
We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.

 
 Machine Learning, ICML, Reinforcement Learning, RL, AI Safety, AI Alignment, Alignment, Robustness, Generalization, Misgeneralization, Safety, Inner Alignment, Objective Robustness, Goal Misgeneralization
 
 
 
 
 
 
 1 Introduction

 
 Out-of-distribution (OOD) generalization, performing well on test data that is not distributed identically to the training set, is a fundamental problem in machine learning (Arjovsky, 2021 ) .
OOD generalization is crucial since in many applications it is not feasible to collect data distributed identically to that which the model will encounter in deployment.

 
 
 In this work, we focus on a particularly concerning type of generalization failure that can occur in RL.
When an RL agent is deployed out of distribution, it may simply fail to take useful actions. However, there exists
an alternative failure mode in which the agent pursues a goal other than the training reward while retaining the capabilities it had on the training distribution.
For example, an agent trained to pursue a fixed coin might not recognize the coin when it is positioned elsewhere, and instead competently navigate to the wrong position (Figure  1 ).
We call this kind of failure goal misgeneralization 1 1 1 
We adopt this term from Shah et al. ( 2022 ) . A previous version of our work used the term ‘objective robustness failure’ instead. We use the term ‘goal’ to refer to goal-directed (optimizing) behavior, not just goal-states in MDPs.
 and distinguish it from capability generalization failures.
We provide the first empirical demonstrations of goal misgeneralization to highlight and illustrate this phenomenon.

 
 
 While it is well-known that the true reward function can be unidentifiable in inverse reinforcement learning (Amin & Singh, 2016 ) , our work shows that a similar problem can also occur in reinforcement learning when features of the environment are correlated and predictive of the reward on the training distribution but not OOD.
In this way, goal m

... (truncated, 68 KB total)

Resource ID: 026e5e85c1abc28a | Stable ID: sid_bzdzhDmU3C