Skip to content
Longterm Wiki
A dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5.
Models Tested
39
Best Score
99.2%
Median Score
77%
Scoring: accuracy
Introduced: 2021-03
Maintainer: Dan Hendrycks et al.

Leaderboard (39 models)

#ModelDeveloperScore
🥇o3OpenAI
99.2%
🥈o4-miniOpenAI
98.5%
🥉o3-miniOpenAI
97.9%
4DeepSeek R1DeepSeek
97.3%
5Gemini 2.5 ProGoogle DeepMind
97.3%
6o1OpenAI
96.4%
7Grok-3xAI
95%
8o1-previewOpenAI
94.8%
9Gemini 2.5 FlashGoogle DeepMind
93.4%
10DeepSeek V3DeepSeek
90.2%
11o1-miniOpenAI
90%
12Gemini 2.0 FlashGoogle DeepMind
83.9%
13Claude Opus 4Anthropic
83%
14Claude Sonnet 4Anthropic
81.4%
15GPT-4.1OpenAI
79.1%
16GeminiGoogle DeepMind
78.3
17ClaudeAnthropic
78.3%
18Claude 3.7 SonnetAnthropic
78%
19Gemini 1.5 FlashGoogle DeepMind
77.9%
20Llama 3.3Meta AI (FAIR)
77%
21GPTOpenAI
76.6
22GPT-4oOpenAI
76.6%
23Grok-2xAI
76.1%
24LlamaMeta AI (FAIR)
73.8
25Llama 3.1Meta AI (FAIR)
73.8%
26Claude Opus 4.5Anthropic
73.7
27GPT-4 TurboOpenAI
73.4%
28Claude 3.5 SonnetAnthropic
71.1%
29GPT-4o miniOpenAI
70.2%
30Claude 3.5 HaikuAnthropic
69.2%
31Mistral Large 2Mistral AI
69.1%
32Gemini 1.5 ProGoogle DeepMind
67.7%
33DeepSeek ModelsDeepSeek
61.6
34Claude 3 OpusAnthropic
60.1%
35Gemini 1.0 UltraGoogle DeepMind
53.2%
36Llama 3Meta AI (FAIR)
50.4%
37Claude 3 SonnetAnthropic
43.1
38GPT-4OpenAI
42.5%
39Claude 3 HaikuAnthropic
38.9%