Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback | Montreal AI Ethics Institute

web

montrealethics.ai·montrealethics.ai/open-problems-and-fundamental-limitatio...

This Montreal AI Ethics Institute piece covers a highly cited paper critiquing RLHF, making it essential reading for understanding the current limits of dominant AI alignment techniques used in LLM development.

Metadata

Importance: 72/100organizational reportcommentary

Summary

This resource, hosted by the Montreal AI Ethics Institute, summarizes and analyzes a landmark paper identifying key open problems and fundamental limitations in RLHF, the dominant technique for aligning large language models. It covers issues including reward model flaws, scalable oversight challenges, human evaluator limitations, and risks of reward hacking. The analysis highlights why RLHF alone is insufficient to guarantee safe and aligned AI systems.

Key Points

•RLHF reward models can be gamed or misspecified, leading to models that appear aligned but optimize for superficial proxies rather than true human preferences.
•Human evaluators providing feedback have cognitive biases, limited expertise, and can be manipulated by convincing-sounding but incorrect outputs.
•Scalable oversight is a core unsolved problem: humans cannot reliably evaluate complex AI outputs beyond their own knowledge or reasoning capacity.
•The paper identifies distributional shift, reward hacking, and value aggregation across diverse human populations as fundamental unresolved challenges.
•RLHF lacks formal guarantees and may reinforce majority preferences while marginalizing minority values, raising fairness and alignment concerns.

Cited by 1 page

Page	Type	Quality
RLHF	Research Area	63.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20269 KB

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback | Montreal AI Ethics Institute 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 

 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 

 
 
 
 

 

 
 
 
 
 
 
 
 
 Skip to main content 
 Skip to secondary menu 
 Skip to primary sidebar 
 Skip to footer 
 
 
 🔬 Research Summary by Stephen Casper , an MIT PhD student working on AI interpretability, diagnostics, and safety.

 

 

 [ Original paper by Stephen Casper,* Xander Davies,* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell]

 

 Overview : Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.

 

 Introduction 

 Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF&#8217;s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models. 

 Key Insights 

 Contributions

 
 Concrete challenges with RLHF: We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the policy. We also distinguish between challenges that are relatively tractable versus ones that are more fundamental limitations of alignment with RLHF.

 Incorporating RLHF into a broader technical safety framework: We discuss how RLHF is not a complete framework for developing safe AI and highlight additional approaches that can help to better understand, improve, and complement it. 

 Governance and transparency: We consider the challenge of improving industry norms and regulations affecting models trained with RLHF. Specifically, we discuss how companies&#8217; disclosure of certain details using RLHF to t

... (truncated, 9 KB total)

Resource ID: 0a13bac6af967fe8 | Stable ID: sid_cTJliEXUPH