Models Tested
22
Best Score
90.99
Median Score
77.2
Scoring: accuracy
Introduced: 2024-06
Maintainer: TIGER Lab
Leaderboard (22 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Gemini | Google DeepMind | 90.99 |
| 🥈 | Claude Opus 4.5 | Anthropic | 89.5 |
| 🥉 | DeepSeek R1 | DeepSeek | 84% |
| 4 | Llama 4 Maverick | Meta AI (FAIR) | 80.5% |
| 5 | o1 | OpenAI | 80.3% |
| 6 | Grok | xAI | 79.9 |
| 7 | Grok-3 | xAI | 79.9% |
| 8 | Gemini 2.5 Pro | Google DeepMind | 79.1% |
| 9 | Claude 3.7 Sonnet | Anthropic | 78.4 |
| 10 | Claude 3.5 Sonnet | Anthropic | 78% |
| 11 | Claude | Anthropic | 78% |
| 12 | Gemini 2.0 Flash | Google DeepMind | 76.4% |
| 13 | Gemini 2.5 Flash | Google DeepMind | 76% |
| 14 | DeepSeek Models | DeepSeek | 75.9 |
| 15 | DeepSeek V3 | DeepSeek | 75.9% |
| 16 | Llama 4 Scout | Meta AI (FAIR) | 74.3% |
| 17 | GPT-4.1 | OpenAI | 74.3% |
| 18 | Llama | Meta AI (FAIR) | 73.3 |
| 19 | GPT-4o | OpenAI | 72.6% |
| 20 | Llama 3.3 | Meta AI (FAIR) | 68.9% |
| 21 | Gemini 1.5 Flash | Google DeepMind | 67.3% |
| 22 | GPT-4o mini | OpenAI | 63.2% |