[1706.03741] Deep Reinforcement Learning from Human Preferences
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Seminal 2017 paper by Christiano et al. (OpenAI/DeepMind) that introduced the RLHF framework now central to aligning large language models; essential reading for anyone studying scalable oversight and human feedback methods.
Paper Details
Metadata
Summary
This paper introduces a method for training RL agents using human feedback on pairs of trajectory segments rather than explicit reward functions, enabling complex behaviors to be learned from a small number of human comparisons. The approach was demonstrated on Atari games and simulated robotics tasks, showing that agents can learn sophisticated behaviors with approximately 900 human comparisons. This work is foundational to the development of RLHF (Reinforcement Learning from Human Feedback) used in modern AI alignment.
Key Points
- •Trains RL agents by learning a reward model from human comparisons between short trajectory clips, bypassing the need for explicit reward engineering.
- •Demonstrates scalability: achieves competitive Atari performance and novel MuJoCo locomotion behaviors with ~900 human labels.
- •Addresses reward hacking by iteratively querying humans about the most informative trajectory pairs, improving sample efficiency.
- •Foundational precursor to RLHF as used in InstructGPT, ChatGPT, and other aligned language models.
- •Highlights that human oversight of AI behavior is feasible even for complex tasks where humans cannot easily specify reward functions.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Hard | Argument | 69.0 |
| Reward Modeling | Approach | 55.0 |
Cached Content Preview
[1706.03741] Deep Reinforcement Learning from Human Preferences
Deep Reinforcement Learning
from Human Preferences
Paul F Christiano
OpenAI
paul@openai.com
&Jan Leike
DeepMind
leike@google.com
&Tom B Brown
nottombrown@gmail.com
&Miljan Martic
DeepMind
miljanm@google.com
&Shane Legg
DeepMind
legg@google.com
&Dario Amodei
OpenAI
damodei@openai.com
Abstract
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments,
we need to communicate complex goals to these systems.
In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.
We show that this approach can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion,
while providing feedback on less than 1% of our agent’s interactions with the environment.
This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems.
To demonstrate the flexibility of our approach,
we show that we can successfully train complex novel behaviors
with about an hour of human time.
These behaviors and environments are considerably more complex
than any which have been previously learned from human feedback.
1 Introduction
Recent success in scaling reinforcement learning (RL) to large problems
has been driven in domains that have a well-specified reward function (Mnih et al., 2015 , 2016 ; Silver et al., 2016 ) .
Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify.
Overcoming this limitation would greatly expand the possible impact of deep RL
and could increase the reach of machine learning more broadly.
For example, suppose that we wanted to use reinforcement learning
to train a robot to clean a table or scramble an egg.
It’s not clear how to construct a suitable reward function,
which will need to be a function of the robot’s sensors.
We could try to design a simple reward function that approximately captures the intended behavior,
but this will often result in behavior that optimizes our reward function without actually
satisfying our preferences.
This difficulty underlies recent concerns about misalignment
between our values and the objectives of our RL systems (Bostrom, 2014 ; Russell, 2016 ; Amodei et al., 2016 ) .
If we could successfully communicate our actual objectives to our agents,
it would be a significant step towards addressing these concerns.
If we have demonstrations of the desired task,
we can extract a reward function using inverse reinforcement learning (Ng and Russell, 2000 ) .
This reward function can then be used to train an agent with reinforcement learning.
More directly, we can use imitation learning
to clone the demonstrated behavior.
However, these approaches are not directly applicable
to behaviors that are
... (truncated, 63 KB total)14df73723b4d14d7 | Stable ID: YjZhZWZlOD