OpenAI's influential "Let's Verify Step by Step" study
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A highly influential OpenAI paper directly relevant to scalable oversight and reward modeling; process supervision is a core technique in alignment research for ensuring models reason correctly, not just produce correct-looking final answers.
Paper Details
Metadata
Abstract
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
Summary
This OpenAI study compares outcome supervision (feedback on final answers) versus process supervision (feedback on each reasoning step) for training reliable LLMs on complex math reasoning. Process supervision significantly outperforms outcome supervision on the MATH dataset, achieving 78% accuracy. The authors release PRM800K, a dataset of 800,000 step-level human feedback labels, to support further research.
Key Points
- •Process supervision (feedback on intermediate steps) significantly outperforms outcome supervision (feedback on final results) for training reliable reasoning models.
- •Their process-supervised reward model (PRM) achieves 78% accuracy on a representative subset of the challenging MATH benchmark.
- •Active learning substantially improves the efficiency and effectiveness of collecting process supervision labels.
- •Released PRM800K: 800,000 step-level human feedback labels used to train their best reward model, enabling future research.
- •Findings have broad implications for scalable oversight and reward modeling in AI alignment.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Accident Risk Cruxes | Crux | 67.0 |
| Process Supervision | Approach | 65.0 |
| Scalable Oversight | Research Area | 68.0 |
Cached Content Preview
# Let’s Verify Step by Step
Hunter Lightman111Primary authors. Correspondence to: Karl Cobbe <karl@openai.com>Vineet Kosaraju11footnotemark: 1Yura Burda11footnotemark: 1Harri Edwards
Bowen Baker Teddy Lee Jan Leike John Schulman Ilya Sutskever
Karl Cobbe11footnotemark: 1OpenAI
###### Abstract
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
## 1 Introduction
Large language models are capable of solving tasks that require complex multi-step reasoning by generating solutions in a step-by-step chain-of-thought format (Nye et al., [2021](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib16 ""), Wei et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib24 ""), Kojima et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib10 "")). However, even state-of-the-art models are prone to producing falsehoods — they exhibit a tendency to invent facts in moments of uncertainty (Bubeck et al., [2023](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib2 "")). These hallucinations(Maynez et al., [2020](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib13 "")) are particularly problematic in domains that require multi-step reasoning, since a single logical error is enough to derail a much larger solution. Detecting and mitigating hallucinations is essential to improve reasoning capabilities.
One effective method involves training reward models to discriminate between desirable and undesirable outputs. The reward model can then be used in a reinforcement learning pipeline (Ziegler et al., [2019](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib26 ""), Stiennon et al., [2020](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib20 ""), Nakano et al., [2021](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib14 ""), Ouyang et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.20050#bib.bib18 ""))
... (truncated, 64 KB total)eea50d24e41938ed | Stable ID: NjkxZmYyYT