[2110.14168] Training Verifiers to Solve Math Word Problems
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Introduces GSM8K dataset and analyzes language model failures on multi-step mathematical reasoning, relevant to AI safety research on robustness and reliability of large language models in structured reasoning tasks.
Paper Details
Metadata
Abstract
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
Summary
GSM8K is a benchmark dataset of 8.5K high-quality grade school math word problems designed to evaluate multi-step mathematical reasoning in language models. The paper demonstrates that state-of-the-art transformer models struggle with this conceptually simple task. To address this limitation, the authors propose a verification approach where multiple candidate solutions are generated and ranked by a trained verifier, showing that verification significantly improves performance and scales more effectively than finetuning baselines.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
Cached Content Preview
# Training Verifiers to Solve Math Word Problems
Karl Cobbe111Equal contribution. Correspondence to: Karl Cobbe <karl@openai.com>, Vineet Kosaraju <vineet@openai.com>Vineet Kosaraju11footnotemark: 1Mohammad Bavarian Mark Chen Heewoo Jun Łukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton Reiichiro Nakano Christopher Hesse John Schulman OpenAI
###### Abstract
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
## 1 Introduction
In recent years, large language models have demonstrated impressive skills across many diverse tasks (Wang et al., [2019](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib26 ""), Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib2 "")). Kaplan et al.( [2020](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib9 "")) describe the consistent benefits of increasing model size, characterizing scaling trends that hold across many orders of magnitude. However, even the largest models falter when required to perform multi-step mathematical reasoning (Hendrycks et al., [2021](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib6 "")). Model samples frequently contain catastrophic mistakes, even after the model has been appropriately finetuned. Mathematical reasoning thus reveals a critical weakness in modern language models.
One significant challenge in mathematical reasoning is the high sensitivity to individual mistakes (Shen et al., [2021a](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib21 "")). When generating a solution, autoregressive models have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable. If we rely purely on generative methods and extrapolate from current trends, we will require an exorbitant parameter count to achieve even moderate performance on distributions as challenging as the MATH dataset (Hendrycks et al., [2021](https://ar5iv.labs.arxiv.org/html/2110.14168#bib.bib6 "")). This evidence strongly motivates the search for methods with more favorable scaling laws.
We propose training verifiers to evaluate the correctness o
... (truncated, 50 KB total)edaaae1b94942ea9 | Stable ID: ZTkzM2JiOG