Models Tested
42
Best Score
94.3
Median Score
65
Scoring: accuracy
Introduced: 2023-11
Maintainer: David Rein et al.
Leaderboard (42 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Gemini | Google DeepMind | 94.3 |
| 🥈 | Claude Opus 4.6 | Anthropic | 91.3% |
| 🥉 | Claude Opus 4.5 | Anthropic | 87% |
| 4 | Gemini 2.5 Pro | Google DeepMind | 84% |
| 5 | Claude Sonnet 4.5 | Anthropic | 83.4% |
| 6 | o3 | OpenAI | 83.3% |
| 7 | Gemini 2.5 Flash | Google DeepMind | 82.8% |
| 8 | o4-mini | OpenAI | 81.4% |
| 9 | Claude Opus 4.1 | Anthropic | 80.9 |
| 10 | Grok-3 | xAI | 80% |
| 11 | o3-mini | OpenAI | 79.7% |
| 12 | o1 | OpenAI | 79.2% |
| 13 | o1-preview | OpenAI | 78% |
| 14 | Claude Opus 4 | Anthropic | 74.1% |
| 15 | Claude Sonnet 4.6 | Anthropic | 74.1% |
| 16 | DeepSeek R1 | DeepSeek | 71.5% |
| 17 | Claude Sonnet 4 | Anthropic | 70.3% |
| 18 | Llama 4 Maverick | Meta AI (FAIR) | 69.8% |
| 19 | Claude 3.7 Sonnet | Anthropic | 68% |
| 20 | Claude 3.5 Sonnet | Anthropic | 65% |
| 21 | GPT-4.1 mini | OpenAI | 65 |
| 22 | Claude | Anthropic | 65% |
| 23 | o1-mini | OpenAI | 60% |
| 24 | DeepSeek Models | DeepSeek | 59.1 |
| 25 | DeepSeek V3 | DeepSeek | 59.1% |
| 26 | Llama 4 Scout | Meta AI (FAIR) | 57.2% |
| 27 | Gemini 2.0 Flash | Google DeepMind | 57% |
| 28 | Grok-2 | xAI | 56.4% |
| 29 | GPT-4.1 | OpenAI | 56.4% |
| 30 | GPT-4o | OpenAI | 53.6% |
| 31 | Llama 3.1 | Meta AI (FAIR) | 50.7% |
| 32 | Claude 3 Opus | Anthropic | 50.4% |
| 33 | GPT-4 Turbo | OpenAI | 49.3% |
| 34 | Llama 3.3 | Meta AI (FAIR) | 49.2% |
| 35 | Mistral Large 2 | Mistral AI | 43.9% |
| 36 | Claude 3.5 Haiku | Anthropic | 41.6% |
| 37 | Claude 3 Sonnet | Anthropic | 40.4% |
| 38 | GPT-4o mini | OpenAI | 39.8% |
| 39 | Llama 3 | Meta AI (FAIR) | 39.5% |
| 40 | GPT-4 | OpenAI | 35.7% |
| 41 | Gemini 1.0 Ultra | Google DeepMind | 35.4% |
| 42 | Claude 3 Haiku | Anthropic | 33.3% |