[1706.03741] Deep Reinforcement Learning from Human Preferences

paper

2025·arXiv·arxiv.org/abs/1706.03741

Authors

Shilong Niu·Xingwei Pan·Jun Wang·Guangliang Li

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Seminal 2017 paper by Christiano et al. (OpenAI/DeepMind) that introduced the RLHF framework now central to aligning large language models; essential reading for anyone studying scalable oversight and human feedback methods.

Paper Details

Citations

4,803

573 influential

Year

2017

arXiv:1706.03741 Semantic Scholar

Metadata

Importance: 92/100arxiv preprintprimary source

Summary

This paper introduces a method for training RL agents using human feedback on pairs of trajectory segments rather than explicit reward functions, enabling complex behaviors to be learned from a small number of human comparisons. The approach was demonstrated on Atari games and simulated robotics tasks, showing that agents can learn sophisticated behaviors with approximately 900 human comparisons. This work is foundational to the development of RLHF (Reinforcement Learning from Human Feedback) used in modern AI alignment.

Key Points

•Trains RL agents by learning a reward model from human comparisons between short trajectory clips, bypassing the need for explicit reward engineering.
•Demonstrates scalability: achieves competitive Atari performance and novel MuJoCo locomotion behaviors with ~900 human labels.
•Addresses reward hacking by iteratively querying humans about the most informative trajectory pairs, improving sample efficiency.
•Foundational precursor to RLHF as used in InstructGPT, ChatGPT, and other aligned language models.
•Highlights that human oversight of AI behavior is feasible even for complex tasks where humans cannot easily specify reward functions.

Cited by 2 pages

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0
Reward Modeling	Approach	55.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202663 KB

[1706.03741] Deep Reinforcement Learning from Human Preferences 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Deep Reinforcement Learning 
 from Human Preferences

 
 
 
Paul F Christiano 
 OpenAI 
 paul@openai.com 
 &Jan Leike 
 DeepMind 
 leike@google.com 
 &Tom B Brown 
 nottombrown@gmail.com 
 &Miljan Martic 
 DeepMind 
 miljanm@google.com 
 &Shane Legg 
 DeepMind 
 legg@google.com 
 &Dario Amodei 
 OpenAI 
 damodei@openai.com 
 
 
 

 
 Abstract

 For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments,
we need to communicate complex goals to these systems.
In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.
We show that this approach can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion,
while providing feedback on less than 1% of our agent’s interactions with the environment.
This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems.
To demonstrate the flexibility of our approach,
we show that we can successfully train complex novel behaviors
with about an hour of human time.
These behaviors and environments are considerably more complex
than any which have been previously learned from human feedback.

 
 
 
 1 Introduction

 
 Recent success in scaling reinforcement learning (RL) to large problems
has been driven in domains that have a well-specified reward function  (Mnih et al., 2015 , 2016 ; Silver et al., 2016 ) .
Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify.
Overcoming this limitation would greatly expand the possible impact of deep RL
and could increase the reach of machine learning more broadly.

 
 
 For example, suppose that we wanted to use reinforcement learning
to train a robot to clean a table or scramble an egg.
It’s not clear how to construct a suitable reward function,
which will need to be a function of the robot’s sensors.
We could try to design a simple reward function that approximately captures the intended behavior,
but this will often result in behavior that optimizes our reward function without actually
satisfying our preferences.
This difficulty underlies recent concerns about misalignment
between our values and the objectives of our RL systems  (Bostrom, 2014 ; Russell, 2016 ; Amodei et al., 2016 ) .
If we could successfully communicate our actual objectives to our agents,
it would be a significant step towards addressing these concerns.

 
 
 If we have demonstrations of the desired task,
we can extract a reward function using inverse reinforcement learning  (Ng and Russell, 2000 ) .
This reward function can then be used to train an agent with reinforcement learning.
More directly, we can use imitation learning
to clone the demonstrated behavior.
However, these approaches are not directly applicable
to behaviors that are

... (truncated, 63 KB total)

Resource ID: 14df73723b4d14d7 | Stable ID: sid_82kx5L999d