RLAIF: Scaling Reinforcement Learning from Human Feedback

paper

2023·arXiv·arxiv.org/abs/2309.00267

Authors

Harrison Lee·Samrat Phatale·Hassan Mansoor·Thomas Mesnard·Johan Ferret·Kellie Lu·Colton Bishop·Ethan Hall·Victor Carbune·Abhinav Rastogi·Sushant Prakash

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper introduces RLAIF as a scalable alternative to RLHF for aligning LLMs by using AI-generated preference labels instead of expensive human feedback, directly addressing a key bottleneck in AI safety alignment research.

Paper Details

Citations

564

37 influential

Year

2023

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

Summary

This paper introduces RLAIF (Reinforcement Learning from AI Feedback), a scalable alternative to RLHF that uses an off-the-shelf LLM to generate preference labels instead of relying on expensive human annotations. The authors demonstrate that RLAIF achieves comparable performance to RLHF across summarization, helpful dialogue, and harmless dialogue tasks. They further show that RLAIF can enable self-improvement and introduce direct-RLAIF (d-RLAIF), which obtains rewards directly from an LLM during RL training, achieving superior performance. These results suggest RLAIF addresses the scalability limitations of RLHF while maintaining competitive alignment quality.

Cited by 1 page

Page	Type	Quality
Constitutional AI	Approach	70.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2309.00267] RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 RLAIF: Scaling Reinforcement Learning from Human Feedback
 with AI Feedback

 
 
 Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu,
 Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash
 Google Research
 {harrisonlee,samratph,hassan}@google.com 
 
 

 
 Abstract

 Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al. , offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.

 
 
 \useunder 
 \ul 

 
 
 
 RLAIF: Scaling Reinforcement Learning from Human Feedback
 with AI Feedback 

 
 

 
 
 Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, 
 
 Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash 
 
 Google Research 
 
 {harrisonlee,samratph,hassan}@google.com 
 

 
 
 
 
 
 1 Introduction

 
 
 
 
 
 
 
 
 
 
 
 Figure 1: Human evaluators strongly prefer RLAIF and RLHF over the SFT baseline for summarization and helpful dialogue generation. Their difference in win rates vs. SFT is not statistically significant. Furthermore, when compared head-to-head, RLAIF is equally preferred to RLHF. For harmless dialogue generation, RLAIF outperforms RLHF. 
 
 
 Reinforcement Learning from Human Feedback (RLHF) is an effective technique for aligning language models to human preferences (Stiennon et al., 2020 ; Ouyang et al., 2022 ) . It is cited as one of the key drivers of success in modern conversational language models, such as ChatGPT (Liu et al., 2023 ) and Bard (Manyika, 2023 ) . Training language models with reinforcement learning (RL) enables optimization on complex, sequence-level objectives that are not easily differentiable and therefore ill-suited for traditional supervised fine-tuning (SFT).

 
 
 One obstacle for employing RLHF at scale

... (truncated, 98 KB total)

Resource ID: dfde4aec10484d70 | Stable ID: sid_hQ2sIVBhfg