Training a Helpful and Harmless Assistant with RLHF (2022)

paper

2022·arXiv·arxiv.org/abs/2204.05862

Authors

Yuntao Bai·Andy Jones·Kamal Ndousse·Amanda Askell·Anna Chen·Nova DasSarma·Dawn Drain·Stanislav Fort·Deep Ganguli·Tom Henighan·Nicholas Joseph·Saurav Kadavath·Jackson Kernion·Tom Conerly·Sheer El-Showk·Nelson Elhage·Zac Hatfield-Dodds·Danny Hernandez·Tristan Hume·Scott Johnston·Shauna Kravec·Liane Lovitt·Neel Nanda·Catherine Olsson·Dario Amodei·Tom Brown·Jack Clark·Sam McCandlish·Chris Olah·Ben Mann·Jared Kaplan

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational paper on Reinforcement Learning from Human Feedback (RLHF) for aligning language models to be helpful and harmless, a core technique in AI safety for ensuring model behavior matches human values.

Paper Details

Citations

3,765

468 influential

Year

2022

arXiv:2204.05862 DOI:10.48550/arXiv.2204.05862 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

Summary

This paper presents a comprehensive approach to aligning language models with human preferences using reinforcement learning from human feedback (RLHF). The authors demonstrate that preference modeling combined with RL-based finetuning improves performance across NLP evaluations while maintaining compatibility with specialized tasks like coding and summarization. They introduce an iterated online training procedure with weekly updates using fresh human feedback and establish a linear relationship between RL reward and KL divergence from the model's initialization, providing insights into the robustness and dynamics of RLHF training.

Cited by 1 page

Page	Type	Quality
Dario Amodei	Person	41.0

Cached Content Preview

HTTP 200Fetched Apr 21, 202698 KB

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

 
 
 Yuntao Bai, Andy Jones, Kamal Ndousse,
& Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, 
 Correspondence to: {yuntao, jared}@anthropic.com 
 Author contributions are listed at the end of the paper. 
    
 Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion,
 & Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, 
 Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, 
 Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, 
 Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan † † footnotemark: \AND 
 Anthropic
 
 
 

 
 Abstract

 We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work. 

 
 

 
 
 
 1 Introduction

 
 We would like to develop techniques to train AI agents that are helpful, honest, and harmless [ Askell et al., 2021 ] . In this paper we show that we can train a relatively helpful and harmless 1 1 1 We do not focus explicitly on honesty/truthfulness in this paper, as we believe that techniques other than pure human feedback may be more efficient and effective at training models to be honest. But we certainly believe that honesty is a crucial goal for AI alignment, and our models do improve on evaluations of honesty (see Figure 5 ). (HH) natural language assistant by collecting human preference data and applying the techniques of preference modeling (PMing) and reinforcement learning from human feedback (RLHF). Our full training process is summarized in Figure 2 .

 
 
 Figure 1: This plot summarizes crowdworker preferences for a variety of models, including context-distilled models, RLHF models trained on our ‘static’ dataset, and RLHF models trained by an iterated ‘online’ method for either helpfulness and harmlessness (HH) or for helpfulness only.
We present both Elo scores and a match to the frequency with which crowdworkers prefer samples as compared to the 52B context-distilled model.
For both helpfulness and harmlessness, a 

... (truncated, 98 KB total)

Resource ID: 68ecccf07cda51c7 | Stable ID: sid_wjTEm562io