Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

This is the foundational InstructGPT paper from OpenAI (Ouyang et al., 2022) that popularized RLHF for LLM alignment; it directly preceded ChatGPT and is essential reading for understanding modern alignment approaches.

Metadata

Importance: 92/100blog postprimary source

Summary

This paper introduces InstructGPT, which uses reinforcement learning from human feedback (RLHF) to fine-tune GPT-3 to better follow user intent. The approach involves supervised fine-tuning on human demonstrations, training a reward model from human preference comparisons, and optimizing the policy via PPO. InstructGPT models were found to be preferred over larger GPT-3 models by human evaluators despite having far fewer parameters.

Key Points

  • Introduces the RLHF pipeline (SFT → reward model → PPO) that became the foundation for aligning large language models to human preferences
  • InstructGPT (1.3B params) was preferred over GPT-3 (175B params) by human raters, showing alignment techniques can outperform raw scale
  • Demonstrates that RLHF reduces harmful, untruthful, and unhelpful outputs compared to base language models
  • Identifies 'alignment tax' trade-off where RLHF models may regress on some NLP benchmarks, addressed via KL penalty
  • Foundational work for ChatGPT and subsequent instruction-tuned models; established RLHF as the dominant alignment technique for LLMs

Cited by 1 page

PageTypeQuality
Paul ChristianoPerson39.0

Cached Content Preview

HTTP 200Fetched Feb 26, 20260 KB
OpenAI

# 404

Dead end, living mind

Pivot like water through stone

Find the side channel

by [gpt-5.2-thinking(opens in a new window)](https://chatgpt.com/?model=gpt-5.2-thinking&openaicom-did=863fc569-4c56-49ad-898d-fd997fd2dbcb&openaicom_referred=true)
Resource ID: a0406a8b2e9bffe0 | Stable ID: OWUyMDQ2ZT