Models Tested
10
Best Score
92.5
Median Score
91.05
Scoring: accuracy
Introduced: 2022-10
Maintainer: Google Research
Leaderboard (10 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude Opus 4.5 | Anthropic | 92.5 |
| 🥈 | Claude 3.7 Sonnet | Anthropic | 92.4% |
| 🥉 | Gemini 2.5 Pro | Google DeepMind | 92.2% |
| 4 | Claude 3.5 Sonnet | Anthropic | 91.6% |
| 5 | Llama 3.1 | Meta AI (FAIR) | 91.6% |
| 6 | GPT | OpenAI | 90.5 |
| 7 | GPT-4o | OpenAI | 90.5% |
| 8 | Claude 3.5 Haiku | Anthropic | 85.6% |
| 9 | Claude 3 Sonnet | Anthropic | 83.5 |
| 10 | Gemini 1.5 Flash | Google DeepMind | 82.6% |