Skip to content
Longterm Wiki
Back

Learning to summarize with human feedback

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Seminal 2020 OpenAI paper establishing RLHF as a practical alignment technique; directly informed InstructGPT and later chat models, making it essential reading for understanding modern LLM alignment methodology.

Metadata

Importance: 85/100blog postprimary source

Summary

OpenAI demonstrates that reinforcement learning from human feedback (RLHF) can train summarization models that significantly outperform supervised learning baselines, including models 10x larger. The work shows that a learned reward model can capture human preferences and generalize across domains, establishing RLHF as a practical alignment technique for language tasks.

Key Points

  • RLHF-trained summarization models outperform much larger supervised models as judged by human evaluators, showing data efficiency gains.
  • A reward model trained on human preference comparisons generalizes to new domains, enabling scalable oversight of model outputs.
  • The paper identifies reward hacking risks where models exploit flaws in the reward model rather than genuinely improving quality.
  • Human labeler agreement and quality significantly impact model performance, highlighting challenges in scalable human feedback pipelines.
  • This work is a foundational precursor to InstructGPT and ChatGPT's alignment methodology.

Cited by 2 pages

PageTypeQuality
Large Language ModelsConcept62.0
SycophancyRisk65.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202636 KB
Learning to summarize with human feedback \| OpenAI

September 4, 2020

[Publication](https://openai.com/research/index/publication/)

# Learning to summarize with human feedback

We’ve applied reinforcement learning from human feedback to train language models that are better at summarization.

[Read paper(opens in a new window)](https://arxiv.org/abs/2009.01325) [(opens in a new window)](https://github.com/openai/summarize-from-feedback) [View samples(opens in a new window)](https://openaipublic.blob.core.windows.net/summarize-from-feedback/website/index.html)

![Learning To Summarize With Human Feedback](https://images.ctfassets.net/kftzwdyauwt9/3583c4ac-dfd4-4af1-d8fa819aea9b/57875fe437269844d643f77b9bffe84b/image-27.webp?w=3840&q=90&fm=webp)

Listen to article

Share

## Why it matters

Our models generate summaries that are better than summaries from 10x larger models trained only with supervised learning. Even though we train our models on the Reddit TL;DR dataset, the same models transfer to generate good summaries of CNN/DailyMail news articles without any further fine-tuning. Our techniques are not specific to summarization; in the long run, our goal is to make aligning AI systems with human preferences a central component of AI research and deployment in many domains.

# Human feedback models outperform much larger supervised models and reference summaries on TL;DR

2e+93e+94e+95e+96e+97e+98e+99e+91e+10Model Size0%20%40%60%80%100%Human preference versus reference summariesReference summariesSupervisedHuman feedbackPre-trained

Figure 1: The performance of various training procedures for different model sizes. Model performance is measured by how often summaries from that model are preferred to the human-written reference summaries. Our pre-trained models are early versions of GPT-3, our supervised baselines were fine-tuned to predict 117K human-written TL;DRs, and our human feedback models are additionally fine-tuned on a dataset of about 65K summary comparisons.

Large-scale language models are becoming increasingly capable on NLP tasks. These models are usually trained with the objective of next word prediction on a dataset of human-written text. But this objective doesn’t capture exactly what we want; usually, we don’t want our models to imitate humans, we want them to give high-quality answers. This mismatch is clear when a model is trained to imitate low-quality human-written text, but it can also happen in more subtle ways. For example, a model trained to predict what a human would say might make up facts when it is unsure, or generate sentences reflecting harmful social bias, both failure modes that have been well-documented.[3](https://openai.com/index/learning-to-summarize-with-human-feedback/#citation-bottom-3), [4](https://openai.com/index/learning-to-summarize-with-human-feedback/#citation-bottom-4), [5](https://openai.com/index/learning-to-summarize-with-human-feedback/#citation-bottom-5), [6](https://openai.com/index/lear

... (truncated, 36 KB total)
Resource ID: 0948b00677caaf7e | Stable ID: NjAyN2YwNz