Skip to content
Longterm Wiki
Back

Reinforcement Learning from Human Feedback

reference

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Wikipedia

A solid introductory reference for understanding RLHF, the dominant alignment technique used in modern LLMs; useful for readers new to the field or seeking a broad overview before diving into primary research papers.

Metadata

Importance: 62/100wiki pagereference

Summary

Wikipedia's overview of Reinforcement Learning from Human Feedback (RLHF), a technique for training AI systems using human preference data as a reward signal. It covers the foundational concepts, history, and applications of RLHF, including its central role in aligning large language models like ChatGPT to human intentions. The article explains the process of collecting human feedback, training reward models, and fine-tuning AI systems via reinforcement learning.

Key Points

  • RLHF uses human preference judgments to train a reward model, which then guides RL-based fine-tuning of AI systems toward desired behaviors.
  • It has become a standard technique for aligning large language models (LLMs) such as ChatGPT, Claude, and others with human values and intentions.
  • The process involves three main steps: supervised fine-tuning, reward model training from human comparisons, and RL optimization (often using PPO).
  • RLHF can reduce harmful outputs and improve helpfulness, but is subject to reward hacking, feedback biases, and scalability challenges.
  • Variants and alternatives such as RLAIF, DPO, and Constitutional AI have been developed to address limitations of standard RLHF.

Cited by 1 page

PageTypeQuality
Why Alignment Might Be HardArgument69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[Jump to content](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback#bodyContent)

[![This is a good article. Click here for more information.](https://upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/20px-Symbol_support_vote.svg.png)](https://en.wikipedia.org/wiki/Wikipedia:Good_articles* "This is a good article. Click here for more information.")

From Wikipedia, the free encyclopedia

|     |     |
| --- | --- |
| [![icon](https://upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/60px-Question_book-new.svg.png)](https://en.wikipedia.org/wiki/File:Question_book-new.svg) | This article **needs additional citations for [verification](https://en.wikipedia.org/wiki/Wikipedia:Verifiability "Wikipedia:Verifiability")**. Please help [improve this article](https://en.wikipedia.org/wiki/Special:EditPage/Reinforcement_learning_from_human_feedback "Special:EditPage/Reinforcement learning from human feedback") by [adding citations to reliable sources](https://en.wikipedia.org/wiki/Help:Referencing_for_beginners "Help:Referencing for beginners"). Unsourced material may be challenged and removed.<br>_Find sources:_ ["Reinforcement learning from human feedback"](https://www.google.com/search?as_eq=wikipedia&q=%22Reinforcement+learning+from+human+feedback%22) – [news](https://www.google.com/search?tbm=nws&q=%22Reinforcement+learning+from+human+feedback%22+-wikipedia&tbs=ar:1) **·** [newspapers](https://www.google.com/search?&q=%22Reinforcement+learning+from+human+feedback%22&tbs=bkt:s&tbm=bks) **·** [books](https://www.google.com/search?tbs=bks:1&q=%22Reinforcement+learning+from+human+feedback%22+-wikipedia) **·** [scholar](https://scholar.google.com/scholar?q=%22Reinforcement+learning+from+human+feedback%22) **·** [JSTOR](https://www.jstor.org/action/doBasicSearch?Query=%22Reinforcement+learning+from+human+feedback%22&acc=on&wc=on)_(January 2026)__([Learn how and when to remove this message](https://en.wikipedia.org/wiki/Help:Maintenance_template_removal "Help:Maintenance template removal"))_ |

Machine learning technique

In [machine learning](https://en.wikipedia.org/wiki/Machine_learning "Machine learning"), **reinforcement learning from human feedback** ( **RLHF**) is a technique to [align](https://en.wikipedia.org/wiki/AI_alignment "AI alignment") an [intelligent agent](https://en.wikipedia.org/wiki/Intelligent_agent "Intelligent agent") with human [preferences](https://en.wikipedia.org/wiki/Preference "Preference"). It involves training a reward model to represent preferences, which can then be used to train other models through [reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning "Reinforcement learning").[\[1\]](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback#cite_note-1)

In classical reinforcement learning, an intelligent agent's goal is to learn a function that guides its behavior, called a [policy](https://en.wikipedia.org/wiki/Reinforcemen

... (truncated, 98 KB total)
Resource ID: a665c398b96149c1 | Stable ID: N2EyOGQxYm