influenced OpenAI's o1 model series

web

dev.to·dev.to/shagun_mistry/lets-verify-step-by-step-how-openai-...

This is a lay summary of the foundational OpenAI paper on process reward models (PRMs); readers seeking depth should consult the original arXiv paper (2305.20050), which directly informed the chain-of-thought reasoning approach used in o1.

Metadata

Importance: 62/100blog postcommentary

Summary

A blog post summarizing the OpenAI research paper 'Let's Verify Step by Step' (arXiv:2305.20050), which demonstrates that process supervision (providing feedback at each reasoning step) significantly outperforms outcome supervision for training reward models on complex multi-step reasoning tasks. The paper introduced PRM800K, a dataset of 800,000 step-level human feedback labels, and its findings directly influenced the development of OpenAI's o1 model series.

Key Points

•Process supervision provides feedback at each reasoning step rather than only on final outcomes, enabling more precise error localization and targeted learning.
•Models trained with process supervision significantly outperform outcome-supervised models on challenging multi-step reasoning benchmarks.
•Active learning strategies for selecting informative solutions to label substantially improve the efficiency of process supervision.
•The PRM800K dataset of 800,000 step-level human feedback labels was released to facilitate further research in reward model training.
•Process supervision encourages human-approved chains of thought, contributing to safer and more interpretable AI reasoning systems.

Cited by 1 page

Page	Type	Quality
Scalable Oversight	Research Area	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202610 KB

Let&#39;s Verify Step by Step: How OpenAI o1 was created - DEV Community 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 

 
 
 

 
 
 
 
 

 
 
 
 

 

 

 

 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Add reaction
 
 

 
 
 
 
 
 
 

 
 Like
 
 

 
 
 
 
 

 
 Unicorn
 
 

 
 
 
 
 

 
 Exploding Head
 
 

 
 
 
 
 

 
 Raised Hands
 
 

 
 
 
 
 

 
 Fire
 
 

 
 
 

 
 
 
 
 

 
 
 
 

 
 Jump to Comments
 
 

 
 
 
 
 

 
 

 
 Save
 
 

 
 
 
 
 

 
 
 Boost
 
 

 
 
 
 
 
 
 More... 

 

 
 
 
 Copy link 
 Copy link 
 
 

 
 Copied to Clipboard 
 

 
 
 Share to X
 
 
 Share to LinkedIn
 
 
 Share to Facebook
 
 
 Share to Mastodon
 
 

 
 
 
 Share Post via... 
 

 Report Abuse 
 
 
 
 

 

 
 

 
 
 

 
 
 Day 9 of reading, understanding, and writing about a research paper. Today's paper is Let's Verify Step by Step .

 This paper looks into the crucial problem of training reliable reward models for large language models (LLMs) that can solve complex multi-step reasoning tasks.

 The authors, a team of researchers from OpenAI, investigate two distinct methods for training reward models: outcome supervision and process supervision. 

 Outcome supervision focuses on providing feedback based on the final result of an LLM's reasoning chain, while process supervision, the focus of this paper, provides feedback at each individual reasoning step. 

 
 
 
 Why Process Supervision?

 Process supervision offers several key advantages over outcome supervision:

 
 
 More precise feedback: By evaluating each step, process supervision pinpoints the exact location of errors, leading to more targeted learning for the model.

 
 Easier for humans to interpret: Humans can easily understand and assess the correctness of individual reasoning steps, making process supervision more amenable to human feedback.

 
 More directly rewards aligned behavior: Process supervision encourages LLMs to follow a human-approved chain of thought, contributing to safer and more interpretable AI systems.

 

 
 
 
 Key Findings of the Paper

 The paper presents several significant findings:

 
 
 Process supervision outperforms outcome supervision: The researchers demonstrate that models trained with process supervision achieve significantly better performance than those trained with outcome supervision on challenging reasoning tasks.

 
 Active learning significantly improves process supervision: By strategically selecting the most informative solutions to label, active learning significantly boosts the efficiency of process supervision.

 
 Large-scale process supervision dataset (PRM800K): To facilitate further research in this area, the paper provides a comprehensive dataset of 800,000 step-level human feedback labels used to train their best reward model.

 

 
 
 
 A Practical Example: Training a Simple Reward Model with Process Supervision

 Let's illustrate the concept of proc

... (truncated, 10 KB total)

Resource ID: 48af4178dfb735bd | Stable ID: sid_bkREEv60no