OpenAI's influential "Let's Verify Step by Step" study

paper

2023·arXiv·arxiv.org/abs/2305.20050

Authors

Hunter Lightman·Vineet Kosaraju·Yura Burda·Harri Edwards·Bowen Baker·Teddy Lee·Jan Leike·John Schulman·Ilya Sutskever·Karl Cobbe

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A highly influential OpenAI paper directly relevant to scalable oversight and reward modeling; process supervision is a core technique in alignment research for ensuring models reason correctly, not just produce correct-looking final answers.

Paper Details

Citations

2,752

390 influential

Year

2023

arXiv:2305.20050 DOI:10.48550/arXiv.2305.20050 Semantic Scholar

Metadata

Importance: 85/100arxiv preprintprimary source

Abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Summary

This OpenAI study compares outcome supervision (feedback on final answers) versus process supervision (feedback on each reasoning step) for training reliable LLMs on complex math reasoning. Process supervision significantly outperforms outcome supervision on the MATH dataset, achieving 78% accuracy. The authors release PRM800K, a dataset of 800,000 step-level human feedback labels, to support further research.

Key Points

•Process supervision (feedback on intermediate steps) significantly outperforms outcome supervision (feedback on final results) for training reliable reasoning models.
•Their process-supervised reward model (PRM) achieves 78% accuracy on a representative subset of the challenging MATH benchmark.
•Active learning substantially improves the efficiency and effectiveness of collecting process supervision labels.
•Released PRM800K: 800,000 step-level human feedback labels used to train their best reward model, enabling future research.
•Findings have broad implications for scalable oversight and reward modeling in AI alignment.

Cited by 3 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Process Supervision	Approach	65.0
Scalable Oversight	Research Area	68.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202656 KB

[2305.20050] Let’s Verify Step by Step 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Let’s Verify Step by Step

 
 
 
 Hunter Lightman 1 1 1 Primary authors. Correspondence to: Karl Cobbe <karl@openai.com>          
 Vineet Kosaraju 1 1 footnotemark: 1          
 Yura Burda 1 1 footnotemark: 1          
 Harri Edwards       
 
 Bowen Baker        
 Teddy Lee        
 Jan Leike        
 John Schulman        
 Ilya Sutskever       
 
 Karl Cobbe 1 1 footnotemark: 1 
 
 OpenAI 
 
 

 
 Abstract

 In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model. 

 
 
 
 1 Introduction

 
 Large language models are capable of solving tasks that require complex multi-step reasoning by generating solutions in a step-by-step chain-of-thought format ( Nye et al. , 2021 , Wei et al. , 2022 , Kojima et al. , 2022 ) . However, even state-of-the-art models are prone to producing falsehoods — they exhibit a tendency to invent facts in moments of uncertainty ( Bubeck et al. , 2023 ) . These hallucinations ( Maynez et al. , 2020 ) are particularly problematic in domains that require multi-step reasoning, since a single logical error is enough to derail a much larger solution. Detecting and mitigating hallucinations is essential to improve reasoning capabilities. 

 
 
 One effective method involves training reward models to discriminate between desirable and undesirable outputs. The reward model can then be used in a reinforcement learning pipeline ( Ziegler et al. , 2019 , Stiennon et al. , 2020 , Nakano et al. , 2021 , Ouyang et al. , 2022 ) or to perform search via rejection sampling ( Nichols et al. , 2020 , Shen et al. , 2021 , Cobbe et al. , 2021 ) . While these techniques are useful, the resulting system is only as reliable as the reward model itself. It is therefore important that we study how to most effectively train reliable reward models. 

 
 
 In closely relat

... (truncated, 56 KB total)

Resource ID: eea50d24e41938ed | Stable ID: sid_pvULRtMndL