[2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
MATH is a widely-used benchmark in AI capabilities research; results here established an early baseline showing scaling limits, later revisited as models like GPT-4 and specialized reasoning models achieved substantially higher scores.
Paper Details
Metadata
Abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.
Summary
This paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low accuracy and that scaling alone is insufficient for mathematical reasoning. The authors also release an auxiliary pretraining dataset to aid mathematical learning. The work highlights a fundamental gap between current scaling trends and genuine mathematical reasoning ability.
Key Points
- •MATH contains 12,500 challenging competition math problems across 7 subjects and 5 difficulty levels, each with full step-by-step solutions.
- •Even very large Transformer models achieve low accuracy on MATH, demonstrating a significant gap in current AI mathematical reasoning capabilities.
- •Simply scaling model size and compute is unlikely to solve MATH, suggesting fundamental algorithmic advances are needed.
- •An auxiliary pretraining dataset was released to help models learn mathematical fundamentals, improving but not solving performance.
- •MATH has become a key benchmark for evaluating reasoning capabilities in subsequent LLM research and development.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
| Center for AI Safety | Organization | 42.0 |
| FAR AI | Organization | 76.0 |
Cached Content Preview
# Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
UC Berkeley
&Collin Burns
UC Berkeley
&Saurav Kadavath
UC Berkeley
&Akul Arora
UC Berkeley
&Steven Basart
UChicago
&Eric Tang
UC Berkeley
&Dawn Song
UC Berkeley
&Jacob Steinhardt
UC Berkeley
###### Abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,5001250012,\\!500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue.
While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.
## 1 Introduction
Mathematics is a highly effective tool in many intellectual endeavors.
It enables us to count and quantify objects, and it can be relied upon because it is consistent and based on logic.
Mathematics pervades the sciences and can be used to model planetary orbits, atomic motion, signal frequencies, and much more. These phenomena can be encoded with mathematics precisely and concisely. This has even led some to describe mathematics as being “unreasonably effective” (Wigner, [1960](https://ar5iv.labs.arxiv.org/html/2103.03874#bib.bib40 "")).
These observations speak to the broad reach and domain-generality of mathematics.
In machine learning, mathematics
is a valuable testbed
for _problem-solving ability_: the ability to analyze a problem, pick out good heuristics from a large set of possibilities, and chain them together to produce an answer. This contrasts with
plug-and-chug calculations,
a skill which ML models can already exhibit (Henighan et al., [2020](https://ar5iv.labs.arxiv.org/html/2103.03874#bib.bib12 "")). Visual or linguistic reasoning may involve limited problem-solving ability for tasks such as image classification, but unlike math this is not the focus of these domains.
To measure the problem-solving
ability of machine learning models, we introduce the MATH
dataset, which consists of
12,5001250012,500 problems from high school math competitions.
Given a problem from MATH, machine learning models generate a sequence, such as `$\frac{2}{3}$`, that encodes the final an
... (truncated, 96 KB total)985b203c41c31efe | Stable ID: YzBjZjY0OD