Skip to content
Longterm Wiki

MATH

Math

A dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5.

Models Tested
31
Best Score
99.2%
Median Score
78%
Scoring: accuracy
Introduced: 2021-03
Maintainer: Dan Hendrycks et al.

Leaderboard31 models

#ModelDeveloperScore
🥇o3OpenAI
99.2%
🥈o4-miniOpenAI
98.5%
🥉o3-miniOpenAI
97.9%
4Gemini 2.5 ProGoogle DeepMind
97.3%
5DeepSeek R1DeepSeek
97.3%
6o1OpenAI
96.4%
7Grok-3xAI
95%
8o1-previewOpenAI
94.8%
9Gemini 2.5 FlashGoogle DeepMind
93.4%
10DeepSeek V3DeepSeek
90.2%
11o1-miniOpenAI
90%
12Gemini 2.0 FlashGoogle DeepMind
83.9%
13Claude Opus 4Anthropic
83%
14Claude Sonnet 4Anthropic
81.4%
15GPT-4.1OpenAI
79.1%
16Claude 3.7 SonnetAnthropic
78%
17Gemini 1.5 FlashGoogle DeepMind
77.9%
18Llama 3.3Meta AI (FAIR)
77%
19GPT-4oOpenAI
76.6%
20Grok-2xAI
76.1%
21Llama 3.1Meta AI (FAIR)
73.8%
22GPT-4 TurboOpenAI
73.4%
23Claude 3.5 SonnetAnthropic
71.1%
24GPT-4o miniOpenAI
70.2%
25Claude 3.5 HaikuAnthropic
69.2%
26Mistral Large 2Mistral AI
69.1%
27Gemini 1.5 ProGoogle DeepMind
67.7%
28Claude 3 OpusAnthropic
60.1%
29Gemini 1.0 UltraGoogle DeepMind
53.2%
30Llama 3Meta AI (FAIR)
50.4%
31GPT-4OpenAI
42.5%