Skip to content
Longterm Wiki
Back

[2110.14168] Training Verifiers to Solve Math Word Problems

paper

Authors

Karl Cobbe·Vineet Kosaraju·Mohammad Bavarian·Mark Chen·Heewoo Jun·Lukasz Kaiser·Matthias Plappert·Jerry Tworek·Jacob Hilton·Reiichiro Nakano·Christopher Hesse·John Schulman

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces GSM8K dataset and analyzes language model failures on multi-step mathematical reasoning, relevant to AI safety research on robustness and reliability of large language models in structured reasoning tasks.

Paper Details

Citations
0
1828 influential
Year
2019
Methodology
peer-reviewed
Categories
Proceedings of the 2019 AERA Annual Meeting

Metadata

arxiv preprintprimary source

Abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Summary

GSM8K is a benchmark dataset of 8.5K high-quality grade school math word problems designed to evaluate multi-step mathematical reasoning in language models. The paper demonstrates that state-of-the-art transformer models struggle with this conceptually simple task. To address this limitation, the authors propose a verification approach where multiple candidate solutions are generated and ranked by a trained verifier, showing that verification significantly improves performance and scales more effectively than finetuning baselines.

Cited by 1 page

PageTypeQuality
AI Capability Threshold ModelAnalysis72.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202650 KB
# Training Verifiers to Solve Math Word Problems

Karl Cobbe111Equal contribution. Correspondence to: Karl Cobbe <karl@openai.com>, Vineet Kosaraju <vineet@openai.com>Vineet Kosaraju11footnotemark: 1Mohammad Bavarian         Mark Chen       Heewoo Jun          Łukasz Kaiser          Matthias Plappert          Jerry Tworek          Jacob Hilton       Reiichiro Nakano       Christopher Hesse       John Schulman       OpenAI

###### Abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

## 1 Introduction

In recent years, large language models have demonstrated impressive skills across many diverse tasks (Wang et al., [2019](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib26 ""), Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib2 "")). Kaplan et al.( [2020](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib9 "")) describe the consistent benefits of increasing model size, characterizing scaling trends that hold across many orders of magnitude. However, even the largest models falter when required to perform multi-step mathematical reasoning (Hendrycks et al., [2021](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib6 "")). Model samples frequently contain catastrophic mistakes, even after the model has been appropriately finetuned. Mathematical reasoning thus reveals a critical weakness in modern language models.

One significant challenge in mathematical reasoning is the high sensitivity to individual mistakes (Shen et al., [2021a](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib21 "")). When generating a solution, autoregressive models have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable. If we rely purely on generative methods and extrapolate from current trends, we will require an exorbitant parameter count to achieve even moderate performance on distributions as challenging as the MATH dataset (Hendrycks et al., [2021](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib6 "")). This evidence strongly motivates the search for methods with more favorable scaling laws.

We propose training verifiers to evaluate the correctness o

... (truncated, 50 KB total)
Resource ID: edaaae1b94942ea9 | Stable ID: ZTkzM2JiOG