Skip to content
Longterm Wiki
Back

RLAIF: Scaling Reinforcement Learning from Human Feedback

paper

Authors

Harrison Lee·Samrat Phatale·Hassan Mansoor·Thomas Mesnard·Johan Ferret·Kellie Lu·Colton Bishop·Ethan Hall·Victor Carbune·Abhinav Rastogi·Sushant Prakash

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper introduces RLAIF as a scalable alternative to RLHF for aligning LLMs by using AI-generated preference labels instead of expensive human feedback, directly addressing a key bottleneck in AI safety alignment research.

Paper Details

Citations
564
37 influential
Year
2023
Methodology
peer-reviewed
Categories
NAECON 2024 - IEEE National Aerospace and Electron

Metadata

arxiv preprintprimary source

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

Summary

This paper introduces RLAIF (Reinforcement Learning from AI Feedback), a scalable alternative to RLHF that uses an off-the-shelf LLM to generate preference labels instead of relying on expensive human annotations. The authors demonstrate that RLAIF achieves comparable performance to RLHF across summarization, helpful dialogue, and harmless dialogue tasks. They further show that RLAIF can enable self-improvement and introduce direct-RLAIF (d-RLAIF), which obtains rewards directly from an LLM during RL training, achieving superior performance. These results suggest RLAIF addresses the scalability limitations of RLHF while maintaining competitive alignment quality.

Cited by 1 page

PageTypeQuality
Constitutional AIApproach70.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# RLAIF: Scaling Reinforcement Learning from Human Feedback   with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu,

Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash

Google Research

{harrisonlee,samratph,hassan}@google.com

###### Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by [Bai et al.](https://ar5iv.labs.arxiv.org/html/2309.00267#bib.bib3 ""), offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.

\\useunder

\\ul

RLAIF: Scaling Reinforcement Learning from Human Feedback

with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu,Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant PrakashGoogle Research{harrisonlee,samratph,hassan}@google.com

## 1 Introduction

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2309.00267/assets/x1.png)

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2309.00267/assets/x2.png)

Figure 1: Human evaluators strongly prefer RLAIF and RLHF over the SFT baseline for summarization and helpful dialogue generation. Their difference in win rates vs. SFT is not statistically significant. Furthermore, when compared head-to-head, RLAIF is equally preferred to RLHF. For harmless dialogue generation, RLAIF outperforms RLHF.

Reinforcement Learning from Human Feedback (RLHF) is an effective technique for aligning language models to human preferences (Stiennon et al., [2020](https://ar5iv.labs.arxiv.org/html/2309.00267#bib.bib39 ""); Ouyang et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.00267#bib.bib34 "")). It is cited as one of the key drivers of success in modern conversational language models, such as ChatGPT (Liu et al., [2023](https://ar5iv.labs.arxiv.org/html/2309.00267#bib.bib26 "")) and Bard (Manyika, [2023](https://ar5iv.labs.arxiv.org/html/2309.00267#bib.bib28 "")). Training language mod

... (truncated, 98 KB total)
Resource ID: dfde4aec10484d70 | Stable ID: NmY1MmU5OD