Self-correction research

paper

2023·arXiv·arxiv.org/abs/2303.16755

Authors

Jérémy Scheurer·Jon Ander Campos·Tomasz Korbak·Jun Shern Chan·Angelica Chen·Kyunghyun Cho·Ethan Perez

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Research on imitation learning from language feedback to align language model outputs with human preferences, addressing harmful text generation and factual errors through improved feedback mechanisms beyond pairwise comparisons.

Paper Details

Citations

5 influential

Year

1983

Methodology

report

arXiv:2303.16755 DOI:10.2172/5781496 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

Summary

This paper introduces Imitation learning from Language Feedback (ILF), a method for aligning language models with human preferences using natural language feedback rather than just pairwise comparisons. ILF operates iteratively by conditioning the model on inputs, initial outputs, and feedback to generate refinements, selecting the best refinement, and finetuning to maximize its likelihood. The authors provide theoretical grounding by connecting ILF to Bayesian inference and demonstrate through experiments on summarization tasks that ILF effectively incorporates feedback, scales well with dataset size, and can outperform finetuning on human-written summaries. Combining language and comparison feedback yields the best results, achieving human-level performance.

Cited by 1 page

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2303.16755] Training Language Models with Language Feedback at Scale 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Training Language Models with Language Feedback at Scale

 
 
 Jérémy Scheurer
 
    
 Jon Ander Campos
 
    
 Tomasz Korbak
 
    
 Jun Shern Chan
 
    
 Angelica Chen
 
    
 Kyunghyun Cho
 
    
 Ethan Perez
 
 

 
 Abstract

 Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF’s effectiveness on a carefully-controlled toy task and a realistic summarization task.
Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

 
 Language Models, Bayesian Inference, Reinforcement Learning from Human Feedback
 
 
 
 
 
 Figure 1 : To learn from language feedback on a language model (LM) output, we have an LM generate multiple refinements of the original output based on the feedback. We use an LM to pick the best refinement and finetune the original LM to maximize the likelihood of the chosen refinement. 
 
 
 
 1 Introduction

 
 Language Models (LMs) achieve strong performance across diverse NLP tasks, from summarization to question answering and dialog (Radford & Narasimhan, 2018 ; Radford et al., 2019 ; Brown et al., 2020 ; Rae et al., 2021 , inter alia ) . One of their key limitations, however, is that they generate text that violates human preferences, such as misinformation (Lin et al., 2021 ) , offensive language (Gehman et al., 2020 ) , and factually incorrect summaries (Stiennon et al., 2020 ) . To alleviate such issues, existing methods train LMs to generate text that scores highly according to human preferences or a predictive model thereof  (Ziegler et al., 2019 ; Stiennon et al., 2020 ; Nakano et al., 2021 ; Ouyang et al., 2022 ) . These approaches learn from human feedback regarding which of two outputs is bette

... (truncated, 98 KB total)

Resource ID: 9f43ad33cfdb0c4d | Stable ID: sid_2V8pKmSnu9