Skip to content
Longterm Wiki
Back

[2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset

paper

Authors

Dan Hendrycks·Collin Burns·Saurav Kadavath·Akul Arora·Steven Basart·Eric Tang·Dawn Song·Jacob Steinhardt

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

MATH is a widely-used benchmark in AI capabilities research; results here established an early baseline showing scaling limits, later revisited as models like GPT-4 and specialized reasoning models achieved substantially higher scores.

Paper Details

Citations
4,613
1071 influential
Year
2021

Metadata

Importance: 72/100arxiv preprintdataset

Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

Summary

This paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low accuracy and that scaling alone is insufficient for mathematical reasoning. The authors also release an auxiliary pretraining dataset to aid mathematical learning. The work highlights a fundamental gap between current scaling trends and genuine mathematical reasoning ability.

Key Points

  • MATH contains 12,500 challenging competition math problems across 7 subjects and 5 difficulty levels, each with full step-by-step solutions.
  • Even very large Transformer models achieve low accuracy on MATH, demonstrating a significant gap in current AI mathematical reasoning capabilities.
  • Simply scaling model size and compute is unlikely to solve MATH, suggesting fundamental algorithmic advances are needed.
  • An auxiliary pretraining dataset was released to help models learn mathematical fundamentals, improving but not solving performance.
  • MATH has become a key benchmark for evaluating reasoning capabilities in subsequent LLM research and development.

Cited by 3 pages

PageTypeQuality
AI Capability Threshold ModelAnalysis72.0
Center for AI SafetyOrganization42.0
FAR AIOrganization76.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202696 KB
# Measuring Mathematical Problem Solving With the   MATH Dataset

Dan Hendrycks

UC Berkeley

&Collin Burns

UC Berkeley

&Saurav Kadavath

UC Berkeley

&Akul Arora

UC Berkeley

&Steven Basart

UChicago

&Eric Tang

UC Berkeley

&Dawn Song

UC Berkeley

&Jacob Steinhardt

UC Berkeley

###### Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,5001250012,\\!500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue.
While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

## 1 Introduction

Mathematics is a highly effective tool in many intellectual endeavors.
It enables us to count and quantify objects, and it can be relied upon because it is consistent and based on logic.
Mathematics pervades the sciences and can be used to model planetary orbits, atomic motion, signal frequencies, and much more. These phenomena can be encoded with mathematics precisely and concisely. This has even led some to describe mathematics as being “unreasonably effective” (Wigner, [1960](https://ar5iv.labs.arxiv.org/html/2103.03874#bib.bib40 "")).
These observations speak to the broad reach and domain-generality of mathematics.

In machine learning, mathematics
is a valuable testbed
for _problem-solving ability_: the ability to analyze a problem, pick out good heuristics from a large set of possibilities, and chain them together to produce an answer. This contrasts with
plug-and-chug calculations,
a skill which ML models can already exhibit (Henighan et al., [2020](https://ar5iv.labs.arxiv.org/html/2103.03874#bib.bib12 "")). Visual or linguistic reasoning may involve limited problem-solving ability for tasks such as image classification, but unlike math this is not the focus of these domains.

To measure the problem-solving
ability of machine learning models, we introduce the MATH
dataset, which consists of
12,5001250012,500 problems from high school math competitions.
Given a problem from MATH, machine learning models generate a sequence, such as `$\frac{2}{3}$`, that encodes the final an

... (truncated, 96 KB total)
Resource ID: 985b203c41c31efe | Stable ID: YzBjZjY0OD