Models Tested
6
Best Score
81.7%
Median Score
71.4%
Scoring: accuracy
Introduced: 2023-11
Leaderboard (6 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Gemini 2.5 Pro | Google DeepMind | 81.7% |
| 🥈 | Claude Opus 4.6 | Anthropic | 76.5 |
| 🥉 | Llama 4 Maverick | Meta AI (FAIR) | 73.4% |
| 4 | Llama 4 Scout | Meta AI (FAIR) | 69.4% |
| 5 | Claude 3.7 Sonnet | Anthropic | 69.1 |
| 6 | Claude 3 Sonnet | Anthropic | 53.1 |