Skip to content
Longterm Wiki
Back

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback | Montreal AI Ethics Institute

web

This Montreal AI Ethics Institute piece covers a highly cited paper critiquing RLHF, making it essential reading for understanding the current limits of dominant AI alignment techniques used in LLM development.

Metadata

Importance: 72/100organizational reportcommentary

Summary

This resource, hosted by the Montreal AI Ethics Institute, summarizes and analyzes a landmark paper identifying key open problems and fundamental limitations in RLHF, the dominant technique for aligning large language models. It covers issues including reward model flaws, scalable oversight challenges, human evaluator limitations, and risks of reward hacking. The analysis highlights why RLHF alone is insufficient to guarantee safe and aligned AI systems.

Key Points

  • RLHF reward models can be gamed or misspecified, leading to models that appear aligned but optimize for superficial proxies rather than true human preferences.
  • Human evaluators providing feedback have cognitive biases, limited expertise, and can be manipulated by convincing-sounding but incorrect outputs.
  • Scalable oversight is a core unsolved problem: humans cannot reliably evaluate complex AI outputs beyond their own knowledge or reasoning capacity.
  • The paper identifies distributional shift, reward hacking, and value aggregation across diverse human populations as fundamental unresolved challenges.
  • RLHF lacks formal guarantees and may reinforce majority preferences while marginalizing minority values, raising fairness and alignment concerns.

Cited by 1 page

PageTypeQuality
RLHFResearch Area63.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202613 KB
- [Skip to main content](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-content)
- [Skip to secondary menu](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-mobile-nav-secondary)
- [Skip to primary sidebar](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-sidebar-primary)
- [Skip to footer](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-footer-widgets)

Menu

![](https://montrealethics.ai/wp-content/uploads/2023/09/DALL%C2%B7E-2023-09-15-18.56.55-a-human-interacting-with-a-computer-via-the-keyboard-and-touch-interface-digital-art.png)

![](https://montrealethics.ai/wp-content/uploads/2023/09/stephan.webp)

🔬 Research Summary by **Stephen Casper**, an MIT PhD student working on AI interpretability, diagnostics, and safety.

\[ [Original paper](https://arxiv.org/abs/2307.15217) by Stephen Casper,\* Xander Davies,\* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell\]

* * *

**Overview**: Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.

* * *

## **Introduction**

Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF’s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models.

## **Key Insights**

### Contributions

1. **Concrete challenges with RLHF:** We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the po

... (truncated, 13 KB total)
Resource ID: 0a13bac6af967fe8 | Stable ID: M2MzYTQzY2