Implemented at OpenAI
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
This is the foundational InstructGPT paper from OpenAI (Ouyang et al., 2022) that popularized RLHF for LLM alignment; it directly preceded ChatGPT and is essential reading for understanding modern alignment approaches.
Metadata
Summary
This paper introduces InstructGPT, which uses reinforcement learning from human feedback (RLHF) to fine-tune GPT-3 to better follow user intent. The approach involves supervised fine-tuning on human demonstrations, training a reward model from human preference comparisons, and optimizing the policy via PPO. InstructGPT models were found to be preferred over larger GPT-3 models by human evaluators despite having far fewer parameters.
Key Points
- •Introduces the RLHF pipeline (SFT → reward model → PPO) that became the foundation for aligning large language models to human preferences
- •InstructGPT (1.3B params) was preferred over GPT-3 (175B params) by human raters, showing alignment techniques can outperform raw scale
- •Demonstrates that RLHF reduces harmful, untruthful, and unhelpful outputs compared to base language models
- •Identifies 'alignment tax' trade-off where RLHF models may regress on some NLP benchmarks, addressed via KL penalty
- •Foundational work for ChatGPT and subsequent instruction-tuned models; established RLHF as the dominant alignment technique for LLMs
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Paul Christiano | Person | 39.0 |
Cached Content Preview
OpenAI # 404 Dead end, living mind Pivot like water through stone Find the side channel by [gpt-5.2-thinking(opens in a new window)](https://chatgpt.com/?model=gpt-5.2-thinking&openaicom-did=863fc569-4c56-49ad-898d-fd997fd2dbcb&openaicom_referred=true)
a0406a8b2e9bffe0 | Stable ID: OWUyMDQ2ZT