Models Tested
13
Best Score
98.4
Median Score
92.3
Scoring: accuracy
Introduced: 2021-10
Maintainer: OpenAI
Leaderboard (13 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude Opus 4.6 | Anthropic | 98.4 |
| 🥈 | Llama | Meta AI (FAIR) | 96.8 |
| 🥉 | Llama 3.1 | Meta AI (FAIR) | 96.8% |
| 4 | Claude 3.7 Sonnet | Anthropic | 96.4 |
| 5 | Claude 3.5 Sonnet | Anthropic | 96.4 |
| 6 | Claude Opus 4.5 | Anthropic | 95 |
| 7 | Claude 3 Sonnet | Anthropic | 92.3 |
| 8 | GPT | OpenAI | 92 |
| 9 | Grok | xAI | 89.3 |
| 10 | DeepSeek Models | DeepSeek | 89.3 |
| 11 | DeepSeek V3 | DeepSeek | 89.3% |
| 12 | GPT-3.5 Turbo | OpenAI | 57.1 |
| 13 | Mistral | Mistral AI | 40.3 |