FrontierMath benchmark

web

Epoch AI·epoch.ai/frontiermath/the-benchmark

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Epoch AI

FrontierMath is a key reference for understanding current AI mathematical reasoning limits; relevant to debates about AI capability timelines and what 'expert-level' AI performance actually means in hard scientific domains.

Metadata

Importance: 72/100tool pagedataset

Summary

FrontierMath is a benchmark of hundreds of expert-crafted mathematics problems spanning modern mathematical research, developed with over 60 mathematicians including Fields medalists. Current leading AI models solve less than 2% of problems, revealing a substantial gap between AI capabilities and expert-level mathematical reasoning. Problems are designed to be novel, automatically verifiable, and 'guessproof' to ensure genuine mathematical understanding is required.

Key Points

•Leading AI models solve <2% of FrontierMath problems, vs near-perfect scores on benchmarks like GSM-8k and MATH, exposing a large capability gap.
•Benchmark developed with 60+ mathematicians including Fields medalists; problems require hours/days for expert mathematicians to solve.
•Spans major branches including number theory, algebraic geometry, and category theory — representing genuine research-level mathematics.
•Problems are automatically verifiable (exact integers or symbolic expressions) and 'guessproof' with <1% chance of correct random guessing.
•Designed to track progress toward human-level mathematical reasoning, serving as a meaningful frontier benchmark as other benchmarks saturate.

Cited by 1 page

Page	Type	Quality
Epoch AI	Organization	51.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202610 KB

FrontierMath: Evaluating advanced mathematical reasoning in AI | Epoch AI | Epoch AI 

 
 
 
 

 

 

 
 We’re introducing FrontierMath, a benchmark of hundreds of original, expert-crafted mathematics problems designed to evaluate advanced reasoning capabilities in AI systems. These problems span major branches of modern mathematics, from computational number theory to abstract algebraic geometry, and typically require hours or days for expert mathematicians to solve. 1 

 

 Figure 1. While leading AI models now achieve near-perfect scores on traditional benchmarks like GSM-8k and MATH, they solve less than 2% of FrontierMath problems, revealing a substantial gap between current AI capabilities and the collective prowess of the mathematics community. MMLU scores shown are for the College Mathematics category of the benchmark.

 
 To understand and measure progress in artificial intelligence, we need carefully designed benchmarks that can assess how well AI systems engage in complex scientific reasoning. Mathematics offers a unique opportunity for this assessment: it requires extended chains of precise reasoning, with each step building exactly on what came before. And unlike many domains where evaluation requires subjective judgment or expensive tests, mathematical problems can be rigorously and automatically verified.

 The FrontierMath Benchmark

 FrontierMath is a benchmark of hundreds of original mathematics problems spanning the breadth of modern mathematical research. These range from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. We developed it through collaboration with over 60 mathematicians from leading institutions, including professors, IMO question writers, and Fields medalists.

 FrontierMath problems typically demand hours or even days for specialist mathematicians to solve. The following Fields Medalists shared their impressions after reviewing some of the research-level problems in the benchmark:

 
 “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…” - Terence Tao, Fields Medal (2006)

 
 “[The questions I looked at] were all not really in my area and all looked like things I had no idea how to solve… they appear to be at a different level of difficulty from IMO problems.” - Timothy Gowers, Fields Medal (2006)

 FrontierMath features hundreds of advanced mathematics problems that require hours for expert mathematicians to solve. We release representative samples, which you may download here .

 Example sample problems include:

 
 Testing Artin’s primitive root conjecture (MSC 11: Number theory)

 Constructing a degree-19 polynomial with specific irreducible-component structure

... (truncated, 10 KB total)

Resource ID: 46010026d8feac35 | Stable ID: sid_96AiECRHXn