[2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset

paper

2021·arXiv·arxiv.org/abs/2103.03874

Authors

Dan Hendrycks·Collin Burns·Saurav Kadavath·Akul Arora·Steven Basart·Eric Tang·Dawn Song·Jacob Steinhardt

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

MATH is a widely-used benchmark in AI capabilities research; results here established an early baseline showing scaling limits, later revisited as models like GPT-4 and specialized reasoning models achieved substantially higher scores.

Paper Details

Citations

4,613

1071 influential

Year

2021

arXiv:2103.03874 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintdataset

Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

Summary

This paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low accuracy and that scaling alone is insufficient for mathematical reasoning. The authors also release an auxiliary pretraining dataset to aid mathematical learning. The work highlights a fundamental gap between current scaling trends and genuine mathematical reasoning ability.

Key Points

•MATH contains 12,500 challenging competition math problems across 7 subjects and 5 difficulty levels, each with full step-by-step solutions.
•Even very large Transformer models achieve low accuracy on MATH, demonstrating a significant gap in current AI mathematical reasoning capabilities.
•Simply scaling model size and compute is unlikely to solve MATH, suggesting fundamental algorithmic advances are needed.
•An auxiliary pretraining dataset was released to help models learn mathematical fundamentals, improving but not solving performance.
•MATH has become a key benchmark for evaluating reasoning capabilities in subsequent LLM research and development.

Cited by 3 pages

Page	Type	Quality
AI Capability Threshold Model	Analysis	72.0
Center for AI Safety (CAIS)	Organization	42.0
FAR AI	Organization	76.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202692 KB

[2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Measuring Mathematical Problem Solving With the
 MATH Dataset

 
 
 Dan Hendrycks
 UC Berkeley
 &Collin Burns
 UC Berkeley
 &Saurav Kadavath
 UC Berkeley
 &Akul Arora
 UC Berkeley
 &Steven Basart
 UChicago
 &Eric Tang
 UC Berkeley
 &Dawn Song
 UC Berkeley
 &Jacob Steinhardt
 UC Berkeley
 
 
 

 
 Abstract

 Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12 , 500 12 500 12,\!500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue.
While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

 
 
 
 1 Introduction

 
 Mathematics is a highly effective tool in many intellectual endeavors.
It enables us to count and quantify objects, and it can be relied upon because it is consistent and based on logic.
Mathematics pervades the sciences and can be used to model planetary orbits, atomic motion, signal frequencies, and much more. These phenomena can be encoded with mathematics precisely and concisely. This has even led some to describe mathematics as being “unreasonably effective” (Wigner, 1960 ) .
These observations speak to the broad reach and domain-generality of mathematics.

 
 
 In machine learning, mathematics
is a valuable testbed
for problem-solving ability : the ability to analyze a problem, pick out good heuristics from a large set of possibilities, and chain them together to produce an answer. This contrasts with
plug-and-chug calculations,
a skill which ML models can already exhibit (Henighan et al., 2020 ) . Visual or linguistic reasoning may involve limited problem-solving ability for tasks such as image classification, but unlike math this is not the focus of these domains.

 
 
 To measure the problem-solving
ability of machine learning models, we introduce the MATH
dataset, which consists of
 12 , 500 12 500 12,500 problems from high school math competitions.
Given a problem from MATH, machine learning models generate a sequence, such as $\frac{2}{3}$

... (truncated, 92 KB total)

Resource ID: 985b203c41c31efe | Stable ID: sid_rNIV2JU7WZ