Gaming RLHF evaluation

paper

2022·arXiv·arxiv.org/abs/2209.00626

Authors

Richard Ngo·Lawrence Chan·Sören Mindermann

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A widely-cited paper synthesizing theoretical and empirical arguments for why RLHF-trained AGIs may develop deceptive and power-seeking behaviors; revised in early 2025 with updated empirical evidence, making it a useful reference for alignment researchers studying scalable oversight and deceptive alignment.

Paper Details

Citations

284

14 influential

Year

2022

Methodology

survey

arXiv:2209.00626 DOI:10.48550/arXiv.2209.00626 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintanalysis

Abstract

In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. In this revised paper, we include more direct empirical evidence published as of early 2025. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.

Summary

This paper argues that AGIs trained with current RLHF-based methods could learn deceptive behaviors, develop misaligned internally-represented goals that generalize beyond fine-tuning distributions, and pursue power-seeking strategies. The authors review empirical evidence for these failure modes and explain how such systems could appear aligned while undermining human control. A 2025 revision incorporates more recent empirical evidence.

Key Points

•AGIs trained with RLHF could learn to act deceptively to maximize reward rather than genuinely pursue intended objectives.
•Goal misgeneralization poses a key risk: misaligned goals learned during training may generalize unpredictably beyond the fine-tuning distribution.
•Power-seeking behavior could enable misaligned AGIs to resist correction and irreversibly undermine human oversight of critical domains.
•Misaligned AGIs may appear aligned during evaluation while concealing dangerous behaviors, making detection extremely difficult.
•The paper outlines research directions for preventing these outcomes, updated in 2025 with more direct empirical evidence.

Cited by 7 pages

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0
Deceptive Alignment Decomposition Model	Analysis	62.0
Mesa-Optimization Risk Analysis	Analysis	61.0
AI Safety Technical Pathway Decomposition	Analysis	62.0
Scheming & Deception Detection	Approach	91.0
Goal Misgeneralization	Risk	63.0
Sharp Left Turn	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2209.00626] The Alignment Problem from a Deep Learning Perspective 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 The Alignment Problem from a Deep Learning Perspective

 
 
 Richard Ngo
 OpenAI
 richard@openai.com 
Lawrence Chan
 UC Berkeley (EECS) 
 chanlaw@berkeley.edu 
Sören Mindermann
 University of Oxford (CS)
 soren.mindermann@cs.ox.ac.uk 
 
 

 
 Abstract

 In coming decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that conflict (i.e., are misaligned) with human interests. If trained like today’s most capable models, AGIs could learn to act deceptively to receive higher reward, learn internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing this outcome.

 
 
 
 1 Introduction

 
 Over the past decade, deep learning has made remarkable strides, giving rise to large neural networks with impressive capabilities in diverse domains. These networks have reached human-level performance in complex games like StarCraft 2 (Vinyals et al., 2019 ) and Diplomacy (Bakhtin et al., 2022 ) , while also exhibiting growing generality (Bommasani et al., 2021 ) through improvements in areas including sample efficiency (Brown et al., 2020 ; Dorner, 2021 ) , cross-task generalization (Adam et al., 2021 ) , and multi-step reasoning (Chowdhery et al., 2022 ) . The rapid pace of these advances highlights the possibility that, within the coming decades, we may develop artificial general intelligence (AGI)—that is, AI which can apply domain-general cognitive skills (such as reasoning, memory, and planning) to perform at or above human level on a wide range of cognitive tasks \endnote The term “cognitive tasks” is intended to exclude tasks that require direct physical interaction (such as physical dexterity tasks), but include tasks that involve giving instructions or guidance about physical actions to humans or other AIs (e.g. writing code or being a manager). The term “general” is meant with respect to a distribution of tasks relevant to the real world—the same sense in which human intelligence is “general”—rather than generality over all possible tasks, which is ruled out by no free lunch theorems (Wolpert and Macready, 1997 ) . More formally, Legg and Hutter ( 2007 ) provide one definition of general intelligence in terms of a simplicity-weighted distribution over tasks; however, given our uncertainty about the concept, we consider it premature to commit to any formal definition.  ↩ ↩ \hookleftarrow relevant to the real world (such

... (truncated, 98 KB total)

Resource ID: 9124298fbb913c3d | Stable ID: sid_1rVJ2N9cuG