Skip to content
Longterm Wiki
Back

Self-correction research

paper

Authors

Jérémy Scheurer·Jon Ander Campos·Tomasz Korbak·Jun Shern Chan·Angelica Chen·Kyunghyun Cho·Ethan Perez

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Research on imitation learning from language feedback to align language model outputs with human preferences, addressing harmful text generation and factual errors through improved feedback mechanisms beyond pairwise comparisons.

Paper Details

Citations
1
5 influential
Year
1983
Methodology
report

Metadata

arxiv preprintprimary source

Abstract

Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

Summary

This paper introduces Imitation learning from Language Feedback (ILF), a method for aligning language models with human preferences using natural language feedback rather than just pairwise comparisons. ILF operates iteratively by conditioning the model on inputs, initial outputs, and feedback to generate refinements, selecting the best refinement, and finetuning to maximize its likelihood. The authors provide theoretical grounding by connecting ILF to Bayesian inference and demonstrate through experiments on summarization tasks that ILF effectively incorporates feedback, scales well with dataset size, and can outperform finetuning on human-written summaries. Combining language and comparison feedback yields the best results, achieving human-level performance.

Cited by 1 page

PageTypeQuality
Long-Horizon Autonomous TasksCapability65.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Training Language Models with Language Feedback at Scale

Jérémy Scheurer
Jon Ander Campos
Tomasz Korbak
Jun Shern Chan
Angelica Chen
Kyunghyun Cho
Ethan Perez

###### Abstract

Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF’s effectiveness on a carefully-controlled toy task and a realistic summarization task.
Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

Language Models, Bayesian Inference, Reinforcement Learning from Human Feedback

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2303.16755/assets/x1.png)Figure 1: To learn from language feedback on a language model (LM) output, we have an LM generate multiple refinements of the original output based on the feedback. We use an LM to pick the best refinement and finetune the original LM to maximize the likelihood of the chosen refinement.

## 1 Introduction

Language Models (LMs) achieve strong performance across diverse NLP tasks, from summarization to question answering and dialog (Radford & Narasimhan, [2018](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib38 ""); Radford et al., [2019](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib40 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib4 ""); Rae et al., [2021](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib41 ""), inter alia). One of their key limitations, however, is that they generate text that violates human preferences, such as misinformation (Lin et al., [2021](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib24 "")), offensive language (Gehman et al., [2020](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib11 "")), and factually incorrect summaries (Stiennon et al., [2020](https://ar5iv.labs.arxiv.org/html/2303.16755#bib.bib49 "")). To alleviate such iss

... (truncated, 98 KB total)
Resource ID: 9f43ad33cfdb0c4d | Stable ID: NDJhZjZmZG