Training a Helpful and Harmless Assistant with RLHF (2022)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Foundational paper on Reinforcement Learning from Human Feedback (RLHF) for aligning language models to be helpful and harmless, a core technique in AI safety for ensuring model behavior matches human values.
Paper Details
Metadata
Abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
Summary
This paper presents a comprehensive approach to aligning language models with human preferences using reinforcement learning from human feedback (RLHF). The authors demonstrate that preference modeling combined with RL-based finetuning improves performance across NLP evaluations while maintaining compatibility with specialized tasks like coding and summarization. They introduce an iterated online training procedure with weekly updates using fresh human feedback and establish a linear relationship between RL reward and KL divergence from the model's initialization, providing insights into the robustness and dynamics of RLHF training.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Dario Amodei | Person | 41.0 |
Cached Content Preview
# Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse,
&Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Correspondence to: {yuntao, jared}@anthropic.com
Author contributions are listed at the end of the paper.Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion,
&Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds,
Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt,
Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark,
Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan††footnotemark: \\AND
Anthropic
###### Abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
## 1 Introduction
We would like to develop techniques to train AI agents that are helpful, honest, and harmless \[ [Askell et al., 2021](https://ar5iv.labs.arxiv.org/html/2204.05862#bib.bibx2 "")\]. In this paper we show that we can train a relatively helpful and harmless111We do not focus explicitly on honesty/truthfulness in this paper, as we believe that techniques other than pure human feedback may be more efficient and effective at training models to be honest. But we certainly believe that honesty is a crucial goal for AI alignment, and our models do improve on evaluations of honesty (see Figure [5](https://ar5iv.labs.arxiv.org/html/2204.05862#S1.F5 "Figure 5 ‣ 1.2 Summary of Evaluations and Metrics ‣ 1 Introduction ‣ Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback")). (HH) natural language assistant by collecting human preference data and applying the techniques of preference modeling (PMing) and reinforcement learning from human feedback (RLHF). Our full training process is summarized in Figure [2](https://ar5iv.labs.arxiv.org/html/2204.05862#S1.F2 "Figure 2 ‣ 1.1 Contributions ‣ 1 Introduction ‣ Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback").
68ecccf07cda51c7 | Stable ID: OTAzMjEzMz