MATH
MathA dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5.
Models Tested
31
Best Score
99.2%
Median Score
78%
Scoring: accuracy
Introduced: 2021-03
Maintainer: Dan Hendrycks et al.
Leaderboard31 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o3 | OpenAI | 99.2% |
| 🥈 | o4-mini | OpenAI | 98.5% |
| 🥉 | o3-mini | OpenAI | 97.9% |
| 4 | Gemini 2.5 Pro | Google DeepMind | 97.3% |
| 5 | DeepSeek R1 | DeepSeek | 97.3% |
| 6 | o1 | OpenAI | 96.4% |
| 7 | Grok-3 | xAI | 95% |
| 8 | o1-preview | OpenAI | 94.8% |
| 9 | Gemini 2.5 Flash | Google DeepMind | 93.4% |
| 10 | DeepSeek V3 | DeepSeek | 90.2% |
| 11 | o1-mini | OpenAI | 90% |
| 12 | Gemini 2.0 Flash | Google DeepMind | 83.9% |
| 13 | Claude Opus 4 | Anthropic | 83% |
| 14 | Claude Sonnet 4 | Anthropic | 81.4% |
| 15 | GPT-4.1 | OpenAI | 79.1% |
| 16 | Claude 3.7 Sonnet | Anthropic | 78% |
| 17 | Gemini 1.5 Flash | Google DeepMind | 77.9% |
| 18 | Llama 3.3 | Meta AI (FAIR) | 77% |
| 19 | GPT-4o | OpenAI | 76.6% |
| 20 | Grok-2 | xAI | 76.1% |
| 21 | Llama 3.1 | Meta AI (FAIR) | 73.8% |
| 22 | GPT-4 Turbo | OpenAI | 73.4% |
| 23 | Claude 3.5 Sonnet | Anthropic | 71.1% |
| 24 | GPT-4o mini | OpenAI | 70.2% |
| 25 | Claude 3.5 Haiku | Anthropic | 69.2% |
| 26 | Mistral Large 2 | Mistral AI | 69.1% |
| 27 | Gemini 1.5 Pro | Google DeepMind | 67.7% |
| 28 | Claude 3 Opus | Anthropic | 60.1% |
| 29 | Gemini 1.0 Ultra | Google DeepMind | 53.2% |
| 30 | Llama 3 | Meta AI (FAIR) | 50.4% |
| 31 | GPT-4 | OpenAI | 42.5% |