Models Tested
9
Best Score
93.1%
Median Score
86.8%
Scoring: accuracy
Introduced: 2022-10
Maintainer: Google / Stanford
Leaderboard (9 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude 3.5 Sonnet | Anthropic | 93.1% |
| 🥈 | Llama 3.1 | Meta AI (FAIR) | 89% |
| 🥉 | DeepSeek V3 | DeepSeek | 87.6% |
| 4 | DeepSeek Models | DeepSeek | 87.5 |
| 5 | Claude 3 Opus | Anthropic | 86.8% |
| 6 | Gemini 1.5 Flash | Google DeepMind | 85.5% |
| 7 | Gemini 1.0 Ultra | Google DeepMind | 83.6% |
| 8 | GPT-4 | OpenAI | 83.1% |
| 9 | Mistral Large 2 | Mistral AI | 78.8% |