Self-correction research
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Research on imitation learning from language feedback to align language model outputs with human preferences, addressing harmful text generation and factual errors through improved feedback mechanisms beyond pairwise comparisons.
Paper Details
Metadata
Abstract
Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.
Summary
This paper introduces Imitation learning from Language Feedback (ILF), a method for aligning language models with human preferences using natural language feedback rather than just pairwise comparisons. ILF operates iteratively by conditioning the model on inputs, initial outputs, and feedback to generate refinements, selecting the best refinement, and finetuning to maximize its likelihood. The authors provide theoretical grounding by connecting ILF to Bayesian inference and demonstrate through experiments on summarization tasks that ILF effectively incorporates feedback, scales well with dataset size, and can outperform finetuning on human-written summaries. Combining language and comparison feedback yields the best results, achieving human-level performance.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
Cached Content Preview
# Training Language Models with Language Feedback at Scale
Jérémy Scheurer
Jon Ander Campos
Tomasz Korbak
Jun Shern Chan
Angelica Chen
Kyunghyun Cho
Ethan Perez
###### Abstract
Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF’s effectiveness on a carefully-controlled toy task and a realistic summarization task.
Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.
Language Models, Bayesian Inference, Reinforcement Learning from Human Feedback
Figure 1: To learn from language feedback on a language model (LM) output, we have an LM generate multiple refinements of the original output based on the feedback. We use an LM to pick the best refinement and finetune the original LM to maximize the likelihood of the chosen refinement.
## 1 Introduction
Language Models (LMs) achieve strong performance across diverse NLP tasks, from summarization to question answering and dialog (Radford & Narasimhan, [2018](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib38 ""); Radford et al., [2019](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib40 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib4 ""); Rae et al., [2021](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib41 ""), inter alia). One of their key limitations, however, is that they generate text that violates human preferences, such as misinformation (Lin et al., [2021](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib24 "")), offensive language (Gehman et al., [2020](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib11 "")), and factually incorrect summaries (Stiennon et al., [2020](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib49 "")). To alleviate such iss
... (truncated, 98 KB total)9f43ad33cfdb0c4d | Stable ID: NDJhZjZmZG