[2110.14168] Training Verifiers to Solve Math Word Problems

paper

2021·arXiv·arxiv.org/abs/2110.14168

Authors

Karl Cobbe·Vineet Kosaraju·Mohammad Bavarian·Mark Chen·Heewoo Jun·Lukasz Kaiser·Matthias Plappert·Jerry Tworek·Jacob Hilton·Reiichiro Nakano·Christopher Hesse·John Schulman

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces GSM8K dataset and analyzes language model failures on multi-step mathematical reasoning, relevant to AI safety research on robustness and reliability of large language models in structured reasoning tasks.

Paper Details

Citations

1828 influential

Year

2019

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Summary

GSM8K is a benchmark dataset of 8.5K high-quality grade school math word problems designed to evaluate multi-step mathematical reasoning in language models. The paper demonstrates that state-of-the-art transformer models struggle with this conceptually simple task. To address this limitation, the authors propose a verification approach where multiple candidate solutions are generated and ranked by a trained verifier, showing that verification significantly improves performance and scales more effectively than finetuning baselines.

Cited by 1 page

Page	Type	Quality
AI Capability Threshold Model	Analysis	72.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202643 KB

[2110.14168] Training Verifiers to Solve Math Word Problems 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Training Verifiers to Solve Math Word Problems

 
 
 
 Karl Cobbe 1 1 1 Equal contribution. Correspondence to: Karl Cobbe <karl@openai.com>, Vineet Kosaraju <vineet@openai.com>           
 Vineet Kosaraju 1 1 footnotemark: 1           
 Mohammad Bavarian          
 Mark Chen        
 Heewoo Jun           
 Łukasz Kaiser           
 Matthias Plappert           
 Jerry Tworek           
 Jacob Hilton        
 Reiichiro Nakano        
 Christopher Hesse        
 John Schulman        
 
 OpenAI 
 
 

 
 Abstract

 State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline. 

 
 
 
 1 Introduction

 
 In recent years, large language models have demonstrated impressive skills across many diverse tasks ( Wang et al. , 2019 , Brown et al. , 2020 ) . Kaplan et al. ( 2020 ) describe the consistent benefits of increasing model size, characterizing scaling trends that hold across many orders of magnitude. However, even the largest models falter when required to perform multi-step mathematical reasoning ( Hendrycks et al. , 2021 ) . Model samples frequently contain catastrophic mistakes, even after the model has been appropriately finetuned. Mathematical reasoning thus reveals a critical weakness in modern language models. 

 
 
 One significant challenge in mathematical reasoning is the high sensitivity to individual mistakes ( Shen et al. , 2021a ) . When generating a solution, autoregressive models have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable. If we rely purely on generative methods and extrapolate from current trends, we will require an exorbitant parameter count to achieve even moderate performance on distributions as challenging as the MATH dataset ( Hendrycks et al. , 2021 ) . This evidence strongly motivates the search for methods with more favorable scaling laws. 

 
 
 We propose training verifiers to evaluate the correctness of model generated solutions, similar to concurrent work by Shen et al. ( 2021a ) . At test time, we sample a fixed number of candidate solutions and

... (truncated, 43 KB total)

Resource ID: edaaae1b94942ea9 | Stable ID: sid_QBctbanNN2