Reinforcement Learning from Human Feedback (RLHF)

web

OpenAI·openai.com/research/learning-from-human-preferences

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

This 2017 OpenAI blog post describes the original RLHF paper (arXiv:1706.03741), which became the foundational technique behind InstructGPT and ChatGPT; essential reading for understanding modern alignment approaches and the practical alternative to hand-coded reward functions.

Metadata

Importance: 92/100blog postprimary source

Summary

OpenAI and DeepMind's safety team introduced Reinforcement Learning from Human Feedback (RLHF), enabling AI systems to learn complex behaviors from comparative human judgments rather than explicit reward specification. The algorithm infers a reward function from pairwise human preference comparisons, demonstrating strong sample efficiency—requiring only ~900 bits of feedback to learn a backflip task. This work is foundational to modern alignment techniques used in systems like ChatGPT.

Key Points

•Eliminates the need for manually specified reward functions by learning from human comparative judgments between pairs of agent behaviors.
•Demonstrated strong sample efficiency: ~900 bits of human feedback (<1 hour) sufficed to teach a backflip, while the agent accumulated 70 hours of simulated experience.
•Uses a 3-step cycle: human compares behavior clips, system infers reward function, RL optimizes against the learned reward.
•Collaborative work between OpenAI and DeepMind's safety team, framed explicitly as an AI safety contribution to reduce reward misspecification.
•Foundational precursor to InstructGPT and ChatGPT's training methodology; one of the most influential practical alignment techniques.

Cited by 1 page

Page	Type	Quality
AI Value Lock-in	Risk	64.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202610 KB

Switch to

- [ChatGPT(opens in a new window)](https://chatgpt.com/?openaicom-did=aa4b90d4-3622-4ed0-b965-3e07802a741a&openaicom_referred=true)
- [Sora(opens in a new window)](https://sora.com/)
- [API Platform(opens in a new window)](https://platform.openai.com/)

Learning from human preferences \| OpenAI

June 13, 2017

[Release](https://openai.com/research/index/release/)

# Learning from human preferences

[Read paper(opens in a new window)](https://arxiv.org/abs/1706.03741)

![Learning From Human Preferences](https://images.ctfassets.net/kftzwdyauwt9/745ba770-7a51-45b1-2dbb2416a1d1/8949061be0f45f0eda1ce3b8acc2a081/image-96.webp?w=3840&q=90&fm=webp)

Listen to article

Share

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

We present a learning algorithm that uses small amounts of human feedback to solve modern RL environments. Machine learning systems with human feedback [have⁠(opens in a new window)](https://papers.nips.cc/paper/4805-a-bayesian-approach-for-policy-learning-from-trajectory-preference-queries) [been⁠(opens in a new window)](https://link.springer.com/chapter/10.1007/978-3-319-02675-6_46) [explored⁠(opens in a new window)](https://arxiv.org/abs/1208.0984) [before⁠(opens in a new window)](https://hal.inria.fr/hal-00980839), but we’ve scaled up the approach to be able to work on much more complicated tasks. Our algorithm needed 900 bits of feedback from a human evaluator to learn to backflip—a seemingly simple task which is simple to judge but [challenging⁠](https://openai.com/index/deep-reinforcement-learning-from-human-preferences/#bflip) to specify.

![Humanfeedbackjump](https://images.ctfassets.net/kftzwdyauwt9/cf6fdf49-ea9e-489d-eb53eceeebc7/03dec4ea90925c03dea2ee6c4976e921/humanfeedbackjump.gif?w=3840&q=90&fm=webp)

Our algorithm learned to backflip using around 900 individual bits of feedback from the human evaluator.

The overall training process is a 3-step feedback cycle between the human, the agent’s understanding of the goal, and the RL training.

![Diagram2x 2](https://images.ctfassets.net/kftzwdyauwt9/76127a33-15be-4357-ee340ba5f9d5/8a212d11cf7ae259505a6c8125c7074f/diagram2x-2.png?w=3840&q=90&fm=webp)

Our AI agent starts by acting randomly in the environment. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments. It then uses RL to learn how to achieve that goal. As its behavior improves, it continues to ask for human fee

... (truncated, 10 KB total)

Resource ID: 27d22b6c3bd3fa6a | Stable ID: sid_DGYoGfGXxY